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Abstract 



We propose a logical/mathematical framework for statistical parameter learning of pa- 
rameterized logic programs, i.e. definite clause programs containing probabilistic facts with 
a parameterized distribution. It extends the traditional least Herbrand model semantics in 
logic programming to distribution semantics , possible world semantics with a probability 
distribution which is unconditionally applicable to arbitrary logic programs including ones 
for HMMs, PCFGs and Bayesian networks. 

We also propose a new EM algorithm, the graphical EM algorithm, that runs for a 
class of parameterized logic programs representing sequential decision processes where each 
decision is exclusive and independent. It runs on a new data structure called support graphs 
describing the logical relationship between observations and their explanations, and learns 
parameters by computing inside and outside probability generalized for logic programs. 

The complexity analysis shows that when combined with OLDT search for all expla- 
nations for observations, the graphical EM algorithm, despite its generality, has the same 
time complexity as existing EM algorithms, i.e. the Baum- Welch algorithm for HMMs, the 
Inside-Outside algorithm for PCFGs, and the one for singly connected Bayesian networks 
that have been developed independently in each research field. Learning experiments with 
PCFGs using two corpora of moderate size indicate that the graphical EM algorithm can 
significantly outperform the Inside-Outside algorithm. 

1. Introduction 

Parameter learning is common in various fields from neural networks to reinforcement learn- 
ing to statistics. It is used to tune up systems for their best performance, be they classifiers 
or statistical models. Unlike these numerical systems described by mathematical formu- 
las however, symbolic systems, typically programs, do not seem amenable to any kind of 
parameter learning. Actually there has been little literature on parameter learning of pro- 
grams. 

This paper is an attempt to incorporate parameter learning into computer programs. 
The reason is twofold. Theoretically we wish to add the ability of learning to computer 
programs, which the authors believe is a necessary step toward building intelligent systems. 
Practically it broadens the class of probability distributions, beyond traditionally used nu- 
merical ones, which are available for modeling complex phenomena such as gene inheritance, 
consumer behavior, natural language processing and so on. 
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The type of learning we consider here is statistical parameter learning applied to logic 
programs. 1 We assume that facts (unit clauses) in a program are probabilistically true and 
have a parameterized distribution. 2 Other clauses, non-unit definite clauses, are always 
true as they encode laws such as "if one has a pair of blood type genes a and b, one's 
blood type is AB". We call logic programs of this type a parameterized logic program and 
use for statistical modeling in which ground atoms 3 provable from the program represent 
our observations such as "one's blood type is AB" and the parameters of the program are 
inferred by performing ML (maximum likelihood) estimation on the observed atoms. 

The probabilistic first-order framework sketched above is termed statistical abduction 
(Sato & Kameya, 2000) as it is an amalgamation of statistical inference and abduction 
where probabilistic facts play the role of abducibles, i.e. primitive hypotheses. 4 Statistical 
abduction is powerful in that it not only subsumes diverse symbolic-statistical frameworks 
such as HMMs (hidden Markov models, Rabiner, 1989), PCFGs (probabilistic context free 
grammars, Wetherell, 1980; Manning & Schiitze, 1999) and (discrete) Bayesian networks 
(Pearl, 1988; Castillo, Gutierrez, & Hadi, 1997) but gives us freedom of using arbitrarily 
complex logic programs for modeling. 5 

The semantic basis for statistical abduction is distribution semantics introduced by Sato 
(1995). It defines a parameterized distribution, actually a probability measure, over the set 
of possible truth assignments to ground atoms and enables us to derive a new EM algorithm 6 
for ML estimation called the graphical EM algorithm (Kameya & Sato, 2000). 

Parameter learning in statistical abduction is done in two phases, search and EM learn- 
ing. Given a parameterized logic program and observations, the first phase searches for all 
explanations for the observations. Redundancy in the first phase is eliminated by tabulating 
partial explanations using OLDT search (Tamaki & Sato, 1986; Warren, 1992; Sagonas, T., 
& Warren, 1994; Ramakrishnan, Rao, Sagonas, Swift, & Warren, 1995; Shen, Yuan, You, & 
Zhou, 2001). It returns a support graph which is a compact representation of the discovered 
explanations. In the second phase, we run the graphical EM algorithm on the support graph 



1. In this paper, logic programs mean definite clause programs. A definite clause program is a set of definite 
clauses. A definite clause is a clause of the form A <— Li, . . . , L„ (0 < n) where A, L\, . . . , L n are atoms. 
A is called the head, L\, . . . , L„ the body. All variables are universally quantified. It reads if L\ and 
• • • and L„ hold, then A holds. In case of n = 0, the clause is called a unit clause. A general clause is 
one whose body may contain negated atoms. A program including general clauses is sometimes called a 
general program (Lloyd, 1984; Doets, 1994). 

2. Throughout this paper, for familiarity and readability, we will somewhat loosely use "distribution" as a 
synonym for "probability measure" . 

3. In logic programming, the adjective "ground" means no variables contained. 

4. Abduction means inference to the best explanation for a set of observations. Logically, it is formalized as 
a search for an explanation E such that E, KB h G where G is an atom representing our observation, KB 
a knowledge base and E a conjunction of atoms chosen from abducibles, i.e. a class of formulas allowed 
as primitive hypotheses (Kakas, Kowalski, & Toni, 1992; Flach & Kakas, 2000). E must be consistent 
with KB. 

5. Existing symbolic-statistical modeling frameworks have restrictions and limitations of various types com- 
pared with arbitrary logic programs (see Section 7 for details). For example, Bayesian networks do not 
allow recursion. HMMs and PCFGs, stochastic grammars, allow recursion but lack variables and data 
structures. Recursive logic programs are allowed in Ngo and Haddawy's (1997) framework but they 
assume domains are finite and function symbols seem prohibited. 

6. "EM algorithm" stands for a class of iterative algorithms for ML estimation with incomplete data 
(McLachlan & Krishnan, 1997). 
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and learn the parameters of the distribution associated with the program. Redundancy in 
the second phase is removed by the introduction of inside and outside probability for logic 
programs computed from the support graph. 

The graphical EM algorithm has accomplished, when combined with OLDT search for 
all explanations, the same time complexity as the specialized ones, e.g. the Baum- Welch 
algorithm for HMMs (Rabiner, 1989) and the Inside-Outside algorithm for PCFGs (Baker, 
1979), despite its generality. What is surprising is that, when we conducted learning exper- 
iments with PCFGs using real corpora, it outperformed the Inside-Outside algorithm by 
orders of magnitudes in terms of time for one iteration to update parameters. These ex- 
perimental results enhance the prospect for symbolic-statistical modeling by parameterized 
logic programs of even more complex systems than stochastic grammars whose modeling 
has been difficult simply because of the lack of an appropriate modeling tool and their sheer 
complexities. The contributions of this paper therefore are 

• distribution semantics for parameterized logic programs which unifies existing symbolic- 
statistical frameworks, 

• the graphical EM algorithm (combined with tabulated search), a general yet efficient 
EM algorithm that runs on support graphs and 

• the prospect suggested by the learning experiments for modeling and learning complex 
symbolic-statistical phenomena. 

The rest of this paper is organized as follows. After preliminaries in Section 2, a proba- 
bility space for parameterized logic programs is constructed in Section 3 as a mathematical 
basis for the subsequent sections. We then propose a new EM algorithm, the graphical 
EM algorithm, for parameterized logic programs in Section 4. Complexity analysis of the 
graphical EM algorithm is presented in Section 5 for HMMs, PCFGs, pseudo PCSGs and 
sc-BNs. 7 Section 6 contains experimental results of parameter learning with PCFGs by the 
graphical EM algorithm using real corpora that demonstrate the efficiency of the graphical 
EM algorithm. We state related work in Section 7, followed by conclusion in Section 8. The 
reader is assumed to be familiar with the basics of logic programming (Lloyd, 1984; Doets, 
1994), probability theory (Chow & Teicher, 1997), Bayesian networks (Pearl, 1988; Castillo 
et al., 1997) and stochastic grammars (Rabiner, 1989; Manning & Schiitze, 1999). 

2. Preliminaries 

Since our subject intersects logic programming and EM learning which are quite different 
in nature, we separate preliminaries. 

2.1 Logic Programming and OLDT 

In logic programming, a program DB is a set of definite clauses 8 and the execution is search 
for an SLD refutation of a given goal <— G. The top-down interpreter recursively selects the 

7. Pseudo PCSGs (probabilistic context sensitive grammars) are a context-sensitive extension of PCFGs 
proposed by Charniak and Carroll (1994). sc-BN is a shorthand for a singly connected Bayesian network 
(Pearl, 1988). 

8. We do not deal with general logic programs in this paper. 
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next goal and unfolds it (Tamaki & Sato, 1984) into subgoals using a nondeterministically 
chosen clause. The computed result by the SLD refutation, i.e. a solution, is an answer 
substitution (variable binding) 9 such that DB h G9. 9 Usually there is more than one 
refutation for <— G, and the search space for all refutations is described by an SLD tree 
which may be infinite depending on the program and the goal (Lloyd, 1984; Doets, 1994). 

More often than not, applications require all solutions. In natural language processing 
for instance, a parser must be able to find all possible parse trees for a given sentence as 
every one of them is syntactically correct. Similarly in statistical abduction, we need to 
examine all explanations to determine the most likely one. All solutions are obtained by 
searching the entire SLD tree, and there is a choice of the search strategy. In Prolog, the 
standard logic programming language, backtracking is used to search for all solutions in 
conjunction with a fixed search order for goals (textually from left- to-right) and clauses 
(textually top-to-bottom) due to the ease and simplicity of implementation. 

The problem with backtracking is that it forgets everything until up to the previous 
choice point, and hence it is quite likely to prove the same goal again and again, resulting in 
exponential search time. One answer to avoid this problem is to store computed results and 
reuse them whenever necessary. OLDT is such an instance of memoizing scheme (Tamaki 
& Sato, 1986; Warren, 1992; Sagonas et al., 1994; Ramakrishnan et al., 1995; Shen et al., 
2001). Reuse of proved subgoals in OLDT search often drastically reduces search time 
for all solutions, especially when refutations of the top goal include many common sub- 
refutations. Take as an example a logic program coding an HMM. For a given string s, 
there exist exponentially many transition paths that output s. OLDT search applied to 
the program however only takes time linear in the length of s to find all of them unlike 
exponential time by Prolog's backtracking search. 

What does OLDT have to do with statistical abduction? From the viewpoint of statisti- 
cal abduction, reuse of proved subgoals, or equivalently, structure sharing of sub-refutations 
for the top-goal G brings about structure sharing of explanations for G, in addition to the 
reduction of search time mentioned above, thereby producing a highly compact representa- 
tion of all explanations for G. 

2.2 EM Learning 

Parameterized distributions such as the multinomial distribution and the normal distribu- 
tion provide convenient modeling devices in statistics. Suppose a random sample x\, . . . , xj 
of size T on a random variable X drawn from a distribution P(X = x \ 9) parameterized 
by unknown 9, is observed. The value of 9 is determined by ML estimation as the MLE 
(maximum likelihood estimate) of 9, i.e. as the maximizer of the likelihood rii<;<x P( x i I 

Things get much more difficult when data are incomplete. Think of a probabilistic 
relationship between non-observable cause X and observable effect Y such as one between 
diseases and symptoms in medicine and assume that Y does not uniquely determine the 
cause X . Then Y is incomplete in the sense that Y does not carry enough information to 
completely determine X. Let P(X = x, Y = y \ 9) be a parameterized joint distribution 
over X and Y . Our task is to perform ML estimation on 9 under the condition that X is 



9. By a solution we ambiguously mean both the answer substitution 9 itself and the proved atom G9, as 
one gives the other. 
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non-observable while Y is observable. Let yi, . . . , yj be a random sample of size T drawn 
from the marginal distribution P(Y = y \ 9) = J2 X P(X = x,Y = y \ 9). The MLE of 9 is 
obtained by maximizing the likelihood Y\i<i<T P{Vi I ^) as a function of 9. 

While mathematical formulation looks alike in both cases, the latter, ML estimation with 
incomplete data, is far more complicated and direct maximization is practically impossible 
in many cases. People therefore looked to indirect approaches to tackle the problem of 
ML estimation with incomplete data to which the EM algorithm has been a standard 
solution (Dempster, Laird, & Rubin, 1977; McLachlan & Krishnan, 1997). It is an iterative 
algorithm applicable to a wide class of parameterized distributions including the multinomial 
distribution and the normal distribution such that the MLE computation is replaced by the 
iteration of two easier, more tractable steps. At ra-th iteration, it first calculates the value 
of Q function introduced below using current parameter value 9^ (E-step) 10 : 

Q(9\9^) = Y J P(* \y,0^)lnP(x,y \9). (1) 

X 

Next, it maximizes Q(9 \ 9^) as a function of 9 and updates 9^ (M-step): 

9^ n+1 ) = argmax e Q(0 | 9^). (2) 

Since the old value 9^ and the updated value do not necessarily coincide, the E-steps 

and M-steps are iterated until convergence, during which the (log) likelihood is assured to 
increase monotonically (McLachlan & Krishnan, 1997). 

Although the EM algorithm merely performs local maximization, it is used in a variety 
of settings due to its simplicity and relatively good performance. One must notice however 
that the EM algorithm is just a class name, taking different form depending on distributions 
and applications. The development of a concrete EM algorithm such as the Baum- Welch 
algorithm for HMMs (Rabiner, 1989) and the Inside-Outside algorithm for PCFGs (Baker, 
1979) requires individual effort for each case. 



10. Q function is related to ML estimation as follows. We assume here only one data, y, is observed. From 
Jensen's inequality (Chow & Teicher, 1997) and the concavity of In function, it follows that 

J2 P ( X I y,d in) )\nP{x | y,6) - J2 P( - X I y>0 (n) )ln-P(z I y,d {n) ) < 

X X 

and hence that 
Q{6 | 6 {n) )-Q{6 {n) | e {n) ) 

= J2 P( - X I y,Q (n> )^P(x | y,8)-J2 P ( x I y,d {n) )\nP{x I y,e (n) ) + \nP(y \ 0) - In P(y | 

X X 

< \nP(y\e)-\nP(y\e^). 
Consequently, we have 

Q(0 1 {n) ) > Q(0 (n) 1 {n) ) iMv I o) > ^p(y I e(n) ) => p(y I *) > p(v I e(n) )- 
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3. Distribution Semantics 

In this section, we introduce parameterized logic programs and define their declarative se- 
mantics. The basic idea is as follows. We start with a set F of probabilistic facts (atoms) 
and a set R of non-unit definite clauses. Sampling from F determines a set F' of true 
atoms, and the least Herbrand model of F' U R determines the truth value of every atom 
in DB = F U R. Hence every atom can be considered as a random variable, taking on 1 
(true) or (false). In what follows, we formalize this process and construct the underlying 
probability space for the denotation of DB. 

3.1 Basic Distribution Pp 

Let DB = FUR be a definite clause program in a first-order language £ with countably many 
variables, function symbols and predicate symbols where F is a set of unit clauses (facts) 
and R a set of non-unit clauses (rules). In the sequel, unless otherwise stated, we consider 
for simplicity DB as the set of all ground instances of the clauses in DB, and assume that 
F and R consist of countably infinite ground clauses (the finite case is similarly treated). 
We then construct a probability space for DB in two steps. First we introduce a probability 
space over the Herbrand interpretations 11 of F i.e. the truth assignments to ground atoms 
in F. Next we extend it to a probability space over the Herbrand interpretations of all 
ground atoms in £ by using the least model semantics (Lloyd, 1984; Doets, 1994). 

Let Ai, A2, . . . be a fixed enumeration of atoms in F. We regard an infinite vector u = 
{x\, X2, . . .) of Os and Is as a Herbrand interpretation of F in such a way that for i = 1,2,... 
Ai is true (resp. false) if and only if X{ = 1 (resp. X{ = 0). Under this isomorphism, the set 
of all possible Herbrand interpretations of F coincides with the Cartesian product: 

00 

8 = 1 

We construct a probability measure Pp over the sample space ftp 12 from a collection of 
finite joint distributions Pp n ^(A\ = x\, . . . , A n = x n ) (n = 1, 2, . . . , X{ £ {0, 1}, 1 < i < n) 
such that 

< P ( F n) (A 1 = Xl ,...,A n = x n )<l 

£*!,...,*„ =x 1 ,...,A n = x n ) = l (3) 

(^T = x l 1 ■ ■ ■ 1 A n +1 = x n+l ) = Pp (^4-1 = x l i • • • i A n = X n ). 

The last equation is called the compatibility condition. It can be proved (Chow & Teicher, 
1997) from the compatibility condition that there exists a probability space (ftp, J 7 , Pp) 
where Pp is a probability measure on J 7 , the minimal a algebra containing open sets of ftp, 
such that for any ra, 

Pf(Ai = X\ , . . . , A n = x n ) = Pp \ Ai = X\ , . . . , A n = x n ). 

11. A Herbrand interpretation interprets a function symbol uniquely as a function on ground terms and 
assigns truth values to ground atoms. Since the interpretation of function symbols is common to all 
Herbrand interpretations, given C, they have a one-to-one correspondence with truth assignments to 
ground atoms in C. So we do not distinguish them. 

12. We regard Q_p as a topological space with the product topology such that each {0, 1} is equipped with 
the discrete topology. 



396 



Parameter Learning of Logic Programs for Symbolic-statistical Modeling 




We call Pp a basic distribution. 13 

(n) 

The choice of P F ' is free as long as the compatibility condition is met. If we want all 
interpretations to be equiprobable, we should set Pp n \A\ = x\, . . . ,A n = x n ) = l/2 n for 
every (x\, . . . , x n ). The resulting Pp is a uniform distribution over ftp just like the one 
over the unit interval [0,1]. If, on the other hand, we stipulate no interpretation except 
ujq = (ci,C2, . . .) should be possible, we put, for each n, 

( n ) _ _ J 1 if Mi Xi = Ci (1 < i < n) 

rp \ A\ — X\ , . . . , A n — x n j - 

Then Pp places all probability mass on ujo and gives probability to the rest. 

Define a parameterized logic program as a definite clause program 14 DB = F U R where 
F is a set of unit clauses, R is a set of non-unit clauses such that no clause head in R is 
unifiable with a unit clause in F and a parameterized basic distribution Pp is associated with 
F. A parameterized Pp is obtained from a collection of parameterized joint distributions 

(n) 

satisfying the compatibility condition. Generally, the more complex P F n s are, the more 
flexible Pp is, but at the cost of tractability. The choice of parameterized finite distributions 
made by Sato (1995) was simple: 

Pg n \ON 1 =X 1 ,OFF 2 = X2,...,OFF2n = X2n\0 1 ,...,6 n ) 

n 

= II Pbs(ON 2i -i = x 2t -i,OFF 2t = x 2i | 9i) 

8 = 1 

where 

P bs (ON 2t -i = x 2t -uOFF 2t = x 2l | 0i) 
iix 2l -i = x 2 i 

9, \ix2 % -\ = 1, x 2l = (4) 
1 - 0i iix 2l -i = 0, x 2l = 1. 

Pbs{ON 2i-i = X2i-i,OFF2i = X2i I 0i) (1 < i < n) represents a probabilistic binary switch, 
i.e. a Bernoulli trial, using two exclusive atoms ONzi-i and OFF2i in such a way that either 
one of them is true on each trial but never both. 6i is a parameter specifying the probability 
that the switch i is on. The resulting Pp is a probability measure over the infinite product of 
independent binary outcomes. It might look too simple but expressive enough for Bayesian 
networks, Markov chains and HMMs (Sato, 1995; Sato & Kameya, 1997). 

3.2 Extending Pp to P DB 

In this subsection, we extend Pp to a probability measure Pp,B over the possible worlds 
for £, i.e. the set of all possible truth assignments to ground atoms in £ through the least 



13. This naming of Pf, despite its being a probability measure, partly reflects the observation that it behaves 
like an infinite joint distribution Pf(A\ = x\,Ai = £2,...) for an infinite random vector (Ai,A2,...) 
of which Pp*\Ai = x\,...,A n = x„) (n = 1,2,...) are marginal distributions. Another reason is 
intuitiveness. These considerations apply to Pdb defined in the next subsection as well. 

14. Here clauses are not necessarily ground. 
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Herbrand model (Lloyd, 1984; Doets, 1994). Before proceeding however, we need a couple 
of notations. For an atom A, define A x by 

A x = A if x = 1 
A x = -i A if x = 0. 

Next take a Herbrand interpretation v £ ftp of F. It makes some atoms in F true and 
others false. Let F v be the set of atoms made true by v. Then imagine a definite clause 
program DB' = R U F v and its least Herbrand model M DB , (Lloyd, 1984; Doets, 1994). 
M DB t is characterized as the least fixed point of a mapping F DB t(-) below 

there is some A <- . . . , B k £ DB' (0 < k) 1 
such that {Bt,...,B k } C / J 

where / is a set of ground atoms. 15 Or equivalently, it is inductively defined by 

io = 

In+l = T DB i(I n ) 

M DB > = \Jl n . 

n 

Taking into account the fact that M DB t is a function of v £ we henceforth employ a 
functional notation Me)b{v) to denote M DB *. 

Turning back, let A\, A2, ... be again an enumeration, but of all ground atoms in £. 16 
Form Hdb-i similarly to ftp, as the Cartesian product of denumerably many {0, l}'s and iden- 
tify it with the set of all possible Herbrand interpretations of the ground atoms Ai, A2, . . . 
in £, i.e. the possible worlds for £. Then extend Pp to a probability measure Pp,B over Hdb 
as follows. Introduce a series of finite joint distributions P]j B (Ai = x\, . . . , A n = x n ) for 
n = 1,2,. . . by 

[A Xl A • • • A A Xn ]p = {v £ n F \ M DB (v) \= A X1 A ■ ■ ■ A A x "} 
P i D B ) (A 1 = x 1 ,...,A n = x n ) = P F ([A X1 A---AA X "] F ). 

Note that the set [A Xl A • • • A A x n n ]p is Pp-measurable and by definition, P^g's satisfy the 
compatibility condition 

^ ] P]JB ^ (^1 = ^1 ' • • • ' A-n+Y = x n+l ) = -^Di] (^1 = ^1 , • • • , A n = X n ). 

Hence there exists a probability measure Pbb over Hdb which is an extension of Pp such 
that 

Pdb{M = x 1 ,...,A n = x n ) = Pf(Ai = x 1 ,...,A n = x n ) 



15. I defines, mutually, a Herbrand interpretation such that a ground atom A is true if and only if A G I. 
A Herbrand model of a program is a Herbrand interpretation that makes every ground instance of every 
clause in the program true. 

16. Note that this enumeration enumerates ground atoms in F as well. 



T DB >(I) = A 
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for any finite atoms A\, . . . , A n in F and for every binary vector (x\, . . . , x n ) (xi £ {0, f }, f < 
i < n). Define the denotation of the program DB = F U R w.r.t. Pp to be Pdb- The de- 
notational semantics of parameterized logic programs defined above is called distribution 
semantics. As remarked before, we regard Pdb as a kind of infinite joint distribution 
Pdb(Ai = x\,A2 = X2,...). Mathematical properties of Pdb are listed in Appendix A 
where our semantics is proved to be an extension of the standard least model semantics in 
logic programming to possible world semantics with a probability measure. 

3.3 Programs as Distributions 

Distribution semantics views parameterized logic programs as expressing distributions. Tra- 
ditionally distributions have been expressed by using mathematical formulas but the use of 
programs as (discrete) distributions gives us far more freedom and flexibility than mathe- 
matical formulas in the construction of distributions because they have recursion and arbi- 
trary composition. In particular a program can contain infinitely many random variables 
as probabilistic atoms through recursion, and hence can describe stochastic processes that 
potentially involve infinitely many random variables such as Markov chains and derivations 
in PCFGs (Manning k Schiitze, 1999). 17 

Programs also enable us to procedurally express complicated constraints on distributions 
such as "the sum of occurrences of alphabets a or b in an output string of an HMM must be 
a multiple of three". This feature, procedural expression of arbitrarily complex (discrete) 
distributions, seems quite helpful in symbolic-statistical modeling. 

Finally, providing mathematically sound semantics for parameterized logic programs 
is one thing, and implementing distribution semantics in a tractable way is another. In 
the next section, we investigate conditions on parameterized logic programs which make 
probability computation tractable, thereby making them usable as a means for large scale 
symbolic-statistical modeling. 

4. Graphical EM Algorithm 

According to the preceding section, a parameterized logic program DB = F U R in a 
first-order language £ with a parameterized basic distribution Pf(- 1 0) over the Herbrand 
interpretations of ground atoms in F specifies a parameterized distribution Pdb{; \ 0) over 
the Herbrand interpretations for £. In this section, we develop, step by step, an efficient EM 
algorithm for the parameter learning of parameterized logic programs by interpreting Pdb 
as a distribution over the observable and non- observable events. The new EM algorithm is 
termed the graphical EM algorithm. It is applicable to arbitrary logic programs satisfying 
certain conditions described later provided the basic distribution is a direct product of 
multi-ary random switches, which is a slight complication of the binary ones introduced in 
Section 3.1. 

From this section on, we assume that DB consists of usual definite clauses containing 
(universally quantified) variables. Definitions and changes relating to this assumption are 

17. An infinite derivation can occur in PCFGs. Take a simple PCFG {p : S — > a, q : S — > SS} where S is a 
start symbol, a a terminal symbol, p + q = 1 and p, q > 0. In this PCFG, S is rewritten either to a with 
probability p or to SS with probability q. The probability of the occurrence of an infinite derivation is 
calculated as max {0, 1 — (p/q)} which is non-zero when q > p (Chi & Geman, 1998). 
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listed below. For a predicate p, we introduce iff (p), the iff definition of p by 

iff(p) d = Vx (p(x)~ 3y 1 (x = t 1 Alfi)V'-V 3y n (x = t n A W„)) . 

Here a; is a vector of new variables of length equal to the arity of p, p(/;) <— Wi (1 < i < 
n, < n), an enumeration of clauses about p in _D_B, and a vector of variables occurring 
in p(ti) <— W 4 -. Then define comp(R) as follows. 

{_B | i? is a ground instance of a clause head appearing in R} 

{iff (p) | p appears in a clause head in R} 

{f(x) = f(y) — ► x = y | / is a function symbol} 

U {f(x) / </(y) | / and g are different function symbols} 
U {/ / x | / is a term properly containing a;} 

iff(i?) U £ g 

_E g , Clark's equational theory (Clark, 1978), deductively simulates unification. Likewise 
comp(R) is a first-order theory which deductively simulates SLD refutation with the help 
of E q by replacing a clause head atom with the clause body (Lloyd, 1984; Doets, 1994). 

We here introduce some definitions which will be frequently used. Let B be an atom. 
An explanation for B w.r.t. DB = F U R is a conjunction S such that S,R h B, and as a 
set comprised of its conjuncts, S C F holds and no proper subset of S satisfies this. The 
set of all explanations for B is called the support set for B and designated by ^^(i?). 18 

4.1 Motivating Example 

First of all, we review distribution semantics by a concrete example. Consider the following 
program DBf, = F\, U Rf, in Figure 1 modeling how one's blood type is determined by blood 
type genes probabilistically inherited from the parents. 19 

The first four clauses in Rf, state a blood type is determined by a genotype, i.e. a pair of 
blood type genes a, b and o. For instance, btype('A') :- (gtype(a,a) ; gtype(a,o) ; 
gtype(o,a)) says that one's blood type is A if his (her) genotype is (a, a), (a, o) or (o,a). 
These are propositional rules. 

Succeeding clauses state general rules in terms of logical variables. The fifth clause 
says that regardless of the values of X and Y, event gtype(X,Y) (one's having genotype 
(X,Y)) is caused by two events, gene(f ather ,X) (inheriting gene X from the father) and 
gene (mother ,Y) (inheriting gene Y from the mother). gene(P,G):- msw(gene,P,G) is a 
clause connecting rules in Rf, with probabilistic facts in Ff,. It tells us that the gene G 
is inherited from a parent P if a choice represented by msw(gene,P,G) 20 is made. The 

18. This definition of a support set differs from the one used by Sato (1995) and Kameya and Sato (2000). 

19. When we implicitly emphasize the procedural reading of logic programs, Prolog conventions are employed 
(Sterling & Shapiro, 1986). Thus, ; stands for "or", , "and" :- "implied by" respectively. Strings 
beginning with a capital letter are (universally quantified) variables, but quoted ones such as 'A' are 
constants. The underscore _ is an anonymous variable. 

20. msw is an abbreviation of "multi-ary random switch" and msw(-, •, •) expresses a probabilistic choice from 
finite alternatives. In the framework of statistical abduction, msw atoms are abducibles from which 
explanations are constructed as a conjunction. 



head(R) = 

F q - 



comp(R) = f 
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Rh 



btype( ' A' ) 
btype('B') 
btype('O') 
btype('AB') 
gtype(X,Y) 
gene(P,G) 



(gtype(a,a) ; gtype(a,o) ; gtype(o,a)) 
(gtype(b,b) ; gtype(b,o) ; gtype(o,b)) 
gtype(o,o) . 

(gtype(a,b) ; gtype(b,a)). 
gene(f ather ,X) , gene (mother, Y) . 
msw(gene,P,G) . 



F b = {msw(gene,f ather, a), msw(gene, father, b),msw(gene, father, o), 

msw (gene .mother , a) , msw (gene .mother , b) , msw (gene .mother , o) } 



Figure 1: ABO blood type program DBf, 

genetic knowledge that the choice of G is by chance and made from {a,b, o} is expressed by 
specifying a joint distribution F b as follows. 

Pp b (msw(gene,^,a) = s,msw(gene , t,b) = msw (gene, £,o) = z\9 a ,9b,9 ) A = 9 T a 9^9 z 

where x,y,z G {0, 1}, x + y + z = 1, a , 0b, O G [0,1], 9 a + 9b + O = 1 and t is either 
father or mother. Thus 9 a is the probability of inheriting gene a from a parent. Statistical 
independence of the choice of gene, once from father and once from mother, is expressed 
by putting 

Pp b ( msw (gene , father, a) = s,msw(gene ,f ather, b) = y, msw (gene, f ather, o) = z, 

msw (gene .mother , a) = x' , msw(gene, mother, b) = y', msw (gene, mother, o) = z' 

I 9 a , 9b, 9 ) 

= P Fb (x,y,z | a ,0 b ,0 o )P Fb (x',y',z' \ a ,0 b ,0 o ). 

In this setting, atoms representing our observation are obs(DB b ) = {btype( ' A' ) , btype( 'B ' ) , 
btype( ' ' ) , btype( ' AB ' ) }. We observe one of them, say btype( ' A' ) , and infer a possible 
explanation 5, i.e. a minimal conjunction of abducibles msw(gene , • , •) such that 

S,R b h btype('A'). 

S is obtained by applying a special SLD refutation procedure to the goal <— btype('A') 
which preserves msw atoms resolved upon in the refutation. Three explanations are found. 

51 = msw(gene,f ather, a) A msw (gene, mother, a) 

52 = msw(gene,f ather, a) A msw(gene, mother, o) 

53 = msw(gene,f ather, o) A msw(gene, mother, a) 

So t/>£)£ b (btype(a)), the support set for btype(a), is {Si, S2, S3}. The probability of each 
explanation is respectively computed as P Fb (Si) = 0\ and Pp^S^) = Pp^S^) = a o - From 
Proposition A. 2 in Appendix A, it follows that P£>£ b (btype( ' A' )) = PDB b (Si V S2 V S3) = 
P Fb (Si V S 2 V S 3 ) and that 

P DBb (btype('A') I a ,0 b ,0o) = PF b (Si) + PF b (S 2 ) + Pp b (S 3 ) 

= 2 a -\- 20 a o . 
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Here we used the fact that Si, S2 and S3 are mutually exclusive as the choice of gene is 
exclusive. Parameters, i.e. a , Of, and 9 are determined by ML estimation performed on a 
random sample such as {btype( ' A' ) , btype( ' ' ) ,btype( ' AB ' )} of btype as follows. 

) = argmax (eaAA) P DBfc (btype ( ' A' ))P DBfc (btype ( ' ' ))P DBfc (btype ( ' AB ' )) 

= argmax (eaAA) (9 2 a + 29 a 9 )9 2 9 a 9 b 

This program contains neither function symbol nor recursion though our semantics 
allows for them. Later we see an example containing both, a program for an HMM (Rabiner 
& Juang, 1993). 

4.2 Four Simplifying Conditions 

DBf, in Figure 1 is simple and probability computation is easy. This is not generally the 
case. Since our primary interest is learning, especially efficient parameter learning of param- 
eterized logic programs, we hereafter concentrate on identifying what property of a program 
makes probability computation easy like DBf,, thereby makes efficient parameter learning 
possible. 

To answer this question precisely, let us formulate the whole modeling process. Suppose 
there exist symbolic-statistical phenomena such as gene inheritance for which we hope 
to construct a probabilistic computational model. We first specify a target predicate p 
whose ground atom p(s) represents our observation of the phenomena. Then to explain 
the empirical distribution of p, we write down a parameterized logic program DB = F U R 
having a basic distribution Pp with parameter 9 that can reproduce all observable patterns 
of p(s). Finally, observing a random sample p(s\), . . . ,p(sx) of ground atoms of p, we 
adjust 9 by ML estimation, i.e. by maximizing the likelihood L(9) = YlJ=i PDB(p( s t) \ 0) so 
that Pdb{p{ ) I ^) approximates as closely to the empirically observed distribution of p as 
possible. 

At first sight, this formulation looks right, but in reality it is not. Suppose two events 
p(s) and p(s') (s / s') are observed. We put L(9) = Pdb(p( s ) I 8)Pdb(p( s ') I But this 
cannot be a likelihood at all simply because in distribution semantics, p(s) and p(s') are 
two different random variables, not two realizations of the same random variable. 

A quick remedy is to note that in the case of blood type program DBf, where obs(DBi>) = 
{btype ( ' A' ) , btype ( 'B ' ) , btype ( ' ' ) , btype ( ' AB ' ) } are observable atoms, only one of 
them is true for each observation, and if some atom is true, others must be false. In other 
words, these atoms collectively behave as a single random variable having the distribution 
PDB b whose values are obs(DBi>). 

Keeping this in mind, we introduce the following condition. Let obs(DB) (c head(R)) 
be a set of ground atoms which represent observable events. We call them observable atoms. 

Uniqueness condition: 

P DB (G A G 1 ) = for any G / G' G obs(DB), and E G eobs(DB) Pdb(G) = 1. 
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The uniqueness condition enables us to introduce a new random variable Y representing 
our observation. Fix an enumeration Gi, G2, . . . of observable atoms in obs(DB) and define 
Y by 21 

Y (u) = k iff u |= G k for u G tt DB (k > 1). (5) 

Let Gfcj , Gfc 2 , . . . , Gj~ T G obs(DB) be a random sample of size T. Then L(0) = Ylt=i PDB(Gk t I 
^) = Tit=i Pdb(Y = kf \ 0) qualifies for the likelihood function w.r.t. Y . 

The second condition concerns the reduction of probability computation to addition. 
Take again the blood type exmaple. The computation of Po£ b (btype( ' A' )) is decomposed 
into a summation because explanations in the support set are mutualy exclusive. So we 
introduce 

Exclusiveness condition: 

For every G G obs(DB) and the support set iPdb(G), Pdb(S A S') = for any S / 
S' G 4> DB (G). 

Using the exclusiveness condition (and Proposition A. 2 in Appendix A), we have 

Pdb(G) = MS)- 

From a modeling point of view, it means that while a single event, or a single observation, 
G, may have several (or even infinite) explanations iPdb(G), only one of iPdb(G) is allowed 
to be true for each observation. 

Now introduce $_d#, i.e. the set of all explanations relevant to obs(DB) by 

^DB = U ^DB(G) 
Geobs(DB) 

and fix an enumeration Si , S2 , . . . of explanations in ^db- It follows from Proposition A. 2, 
the uniqueness condition and the exclusiveness condition that 

Pdb{Si A Sj) = for i / j and 

J2 Pdb(S) = E E p DB(S) 
se^DB Geobs(DB) se4> DB (G) 

= £ Pdb(G) 

Geobs(DB) 
= 1. 

So we are able to introduce under the uniqueness condition and the exclusiveness condition 
yet another random variable X e , representing an explanation for G, defined by 

X e (uj) = k iff lo |= Sk for uj G £Idb- (6) 

The third condition concerns termination. 



21. XyGgobs(DB) Pdb(G) = 1 only guarantees that the measure of { lo \ lo |= Gk for some h (> 1)} is one, so 
there can be some lo satisfying no GVs. In such case, we put Y (lo) = 0. But values on a set of measure 
zero do not affect any part of the discussion that follows. This also applies to the definition of X e in (6). 
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Finite support condition: 

For every G £ obs(DB) iPdb(G) is finite. 

Pdb(G) is then computed from the support set iPdb(G) = {S\, . . . , S m } (0 < to), with 
the help of the exclusiveness condition, as a finite summation YliLi PF(Si)- This condition 
prevents an infinite summation that is hardly computable. 

The fourth condition simplifies the probability computation to multiplication. Recall 
that an explanation S for G £ obs(DB) is a conjunction a\ A • • • A a m of some abducibles 
{ai, . . . , a m } C F (1 < to). In order to reduce the computation of Pp(S) = Pp( a i A • • • Aa m ) 
to the multiplication Pj?(ai) • • • Pp(a m ), we assume 

Distribution condition: 

P is a set P msH of ground atoms with a parameterized distribution P msH specified below. 

Here atom msw(i,n,f ) is intended to simulate a multi-ary random switch whose name is i 
and whose outcome is v on trial n. It is a generalization of primitive probabilistic events 
such as coin tossing and dice rolling. 

1- -PmsH consists of probabilistic atoms msw(i,n,f). The arguments i, n and v are ground 
terms called switch name, trial-id and a value (of the switch i), respectively. We 
assume that a finite set V{ of ground terms called the value set of i is associated with 
each i, and v £ V{ holds. 

2. Write V{ as {fi, f2, . . . , v m } (to = \Vi\). Then, one of the ground atoms { msw(i ,n,v\) , 
msw (i , n , v 2 ) , . . . , msw (i ,n,v m )} becomes exclusively true (takes on value 1) on each 
trial. With each i, a parameter 0i <v £ [0,1] such that J2 v eV t ®i,v = 1 is associated. 0i <v 
is the probability of msw(i,-,f) being true (v £ V{). 

3. For each ground terms i, i' , n, n', v £ Vi and v' £ VJ/, random variable msw(i,ra,f ) is 
independent of msw(i',n',t/) if ra / ra' or i / i'. 

In other words, we introduce a family of parameterized finite distributions P^ >n ) such that 

P( 8jn )(msw(i,ra,-ui) = zi, . . . ,msw(i,ra,w m ) = s m | hVl , . . . , 6» w ) 

I o.w. 1 ; 

where to = \Vi\, Xk £ {0, 1} (1 < k < to), and define P msH as their infinite product 

p ¥ TT p 
r mss — 11 • 

Under this condition, we can compute P mS y(S), the probability of an explanation 5, as the 
product of parameters. Suppose msw(ij,ra,f) and msw(ij/ ,n' ,v') are different conjuncts in 
an explanation S = msw(ii ,ri\ ,v\) A • • • A msw(i^ ,v^) . If either j / j' or n / n' holds, 
they are independent by construction. Else if j = j' and n = n' but v / v', they are not 
independent but P mS y(S) = by construction. As a result, whichever condition may hold, 
Pmsw(S) is computed from the parameters. 
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4.3 Modeling Principle 

Up to this point, we have introduced four conditions, the uniqueness condition, the exclu- 
siveness condition, the finite support condition and the distribution condition, to simplify 
probability computation. The last one is easy to satisfy. We just adopt F msw together with 
-Pmsw So, from here on, we always assume that _F msH has a parameterized distribution P msH 
introduced in the previous subsection. Unfortunately the rest are not satisfied automati- 
cally. According to our modeling experiences however, it is only mildly difficult to satisfy 
the uniqueness condition and the exclusiveness condition as long as we obey the following 
modeling principle. 

Modeling principle: DB = F mS yUR describes a sequential decision process 
(modulo auxiliary computations) that uniquely produces an observable atom 
G G obs(DB) where each decision is expressed by some msw atom. 22 

Translated into programming level, it says that we must take care when writing a pro- 
gram so that for any sample F' from P msH ; there must uniquely exist goal <— G (G G obs(DB)) 
which has a successful refutation from DB 1 = F' U R. We can confirm the principle by the 
blood type program DB b = F b U R b . It describes a process of gene inheritance, and for 
an arbitrary sample F b ' from P msy , say F b ' = {msw(gene, father, a), msw(gene, mother, o)}, 
there exists a unique goal, ^btype('A') in this case, that has a successful SLD refutation 
from F b ' U R b . 

The idea behind this principle is that a decision process always produces some result (an 
observable atom), and different decision processes must differ at some msw thereby entailing 
mutually exclusive observable atoms. So the uniqueness condition and the exclusiveness 
condition will be automatically satisfied. 

Satisfying the finite support condition is more difficult as it is virtually equivalent to 
writing a program DB for which all solution search for (G £ obs(DB)) always termi- 
nates. Apparently we have no general solution to this problem, but as far as specific models 
such as HMMs, PCFGs and Bayesian networks are concerned, it can be met. All programs 
for these models satisfy the finite support condition (and other conditions as well). 

4.4 Four Conditions Revisited 

In this subsection, we discuss how to relax the four simplifying conditions introduced in Sub- 
section 4.2 for the purpose of flexible modeling. We first examine the uniqueness condition 
considering its crucial role in the adaptation of the EM algorithm to our semantics. 

The uniqueness condition guarantees that there exists a (many-to-one) mapping from 
explanations to observations so that the EM algorithm is applicable (Dempster et al., 1977). 
It is possible, however, to relax the uniqueness condition while justifying the application 
of the EM algorithm. We assume the MAR (missing at random) condition introduced by 
Rubin (1976) which is a statistical condition on how a complete data (explanation) be- 
comes an incomplete data (observation), and is customarily assumed implicitly or explicitly 
in statistics (see Appendix B). By assuming the MAR condition, we can apply our EM 



22. Decisions made in the process are a finite subset of i*i 
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algorithm to non-exclusive observations such that J2o P(P) — 1 where the uniqueness 
condition is seemingly destroyed. 

Let us see the MAR condition in action with a simple example. Imagine we walk along 
a road in front of a lawn. We occasionally observe their state such as "the road is dry but 
the lawn is wet". Assume that the lawn is watered by a sprinkler running probabilistically. 
The program DB X ± = R x ± U F x ± in Figure 2 describes a sequential process which outputs 
an observation observed(road(a;) ,lawn(y) ) ("the road is x and the lawn is y v ) where 
x, y £ {wet, dry}. 

R Tl = { observed(road(X) ,lawn(Y) ) : - 
msw (rain , once , A) , 
( A = yes, X = wet, Y = wet 
; A = no, msw(sprinkler,once,B) , 
( B = on, X = dry, Y = wet 
; B = off , X = dry, Y = dry ) ) . } 
F Tl = { msw(rain, once, yes) , msw(rain,once,no) , 

msw (sprinkler, once, on) , msw (sprinkler , once, off ) } 

Figure 2: DB Tl 

The basic distribution over F Tl is specified like PF b (/) in Subsection 4.1, so we omit it. 
msw (rain, once, A) in the program determines whether it rains (A = yes) or not (A = no), 
whereas msw(sprinkler,once,B) determines whether the sprinkler works fine (B = on) 
or not (B = off). Since for each sampled values of A = a (a £ {yes, no}) and B = b 
(b £ {on, off}), there uniquely exists an observation observed(road(a;) , lawn(y)) (x,y £ 
{wet, dry}), there is a many-to-one mapping x '■ x( a >b) = (x,y). In other words, we 
can apply the EM algorithm to the observations observed(road(a;) ,lawn(y)) (x,y £ 
{wet, dry}). What would happen if we observe exclusively either a state of the road or 
that of the lawn? Logically, this means we observe By observed(road(a;) ,lawn(y) ) or 
3x observed(road(a;) , lawn(y)). Apparently the uniqueness condition is not met, because 
By observed(road(wet) , lawn(y)) and 3x observed(road(a;) ,lawn(wet)) are compatible 
(they are true when it rains). Despite the non-exclusiveness of the observations, we can still 
apply the EM algorithm to them under the MAR condition, which in this case translates 
into that we observe either the lawn or the road randomly regardless of their state. 

We now briefly check other conditions. Basically they can be relaxed at the cost of 
increased computation. Without the exclusiveness condition for instance, we would need an 
additional process of transforming the support set iPdb(G) for a goal G into a set of exclusive 
explanations. For instance, if G has explanations {msw(a,n,f), msw(6,m,w)}, we have to 
transform it into {msw(a,ra,f), -imsw(a,ra,f) Amsw(6,m,w)} and so on. 23 Clearly, this 
transformation is exponential in the number of msw atoms and efficiency concern leads to 
assuming the exclusiveness condition. 

The finite support condition is in practice equivalent to the condition that the SLD tree 
for <— G is finite. So relaxing this condition might induce infinite computation. 



23. -imsw(a , n , v) is further transformed to a disjunction of exclusive msw atoms like \J v ,^ v v i ev msw (a ,n ,v') . 
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Relaxing the distribution condition and accepting probability distributions other than 
-PmsH serve to expand the horizon of the applicability of parameterized logic programs. In 
particular the introduction of parameterized joint distributions P(v\, . . . ,Vk) like Boltz- 
mann distributions over switches mswi, . . . ,mswfc where v\, . . . , Vk are values of the switches, 
makes them correlated. Such distributions facilitate writing parameterized logic programs 
for complicated decision processes in which decisions are not independent but interdepen- 
dent. Obviously, on the other hand, they increase learning time, and whether the added 
flexibility of distributions deserves the increased learning time or not is yet to be seen. 

4.5 Naive Approach to EM Learning 

In this subsection, we derive a concrete EM algorithm for parameterized logic programs 
DB = F msy U R assuming that they satisfy the uniqueness condition, the exclusiveness 
condition and the finite support condition. 

To start, we introduce Y , a random variable representing our observations according 
to (5) based on a fixed enumeration of observable atoms in obs(DB). We also introduce 
another random variable X e representing their explanations according to (6) based on some 
fixed enumeration of explanations in ^db- Our understanding is that X e is non-observable 
while Y is observable, and they have a joint distribution P£©(X e = x,Y = y \ 9) where 
9 denotes relevant parameters. It is then immediate, following (1) and (2) in Section 2, to 

derive a concrete EM algorithm from the Q function defined by Q(9 \ 9') = f J2 X Pdb( x \ 
y, 9') In Pdb{ x iV \ 9) whose input is a random sample of observable atoms and whose output 
is the MLE of 9. 

In the following, for the sake of readability, we substitute an observable atom G (G £ 
obs(DB)) for Y = y and write Pdb(G \ 9) instead of Pdb(Yq = V \ 9). Likewise we 
substitute an explanation 5 (5 £ ^db) f° r X e = x and write Pe>b(S,G \ 9) instead of 
PDB{X e = x,Y = y | 9). Then it follows from the uniqueness condition that 



We need yet another notation here. For an explanation 5, define the count o/msw(i,ra,f ) 
in S by 

<Ji )V (S) ^ |{ n | msw(i,ra,f ) £ 5}| . 

We have done all preparations now. Suppose we make some observations Q = G\, . . . , Gj 
where G t £ obs(DB) (1 < t < T). Put 



/ is a set of switch names that appear in some explanation for one of the Gt's and denotes 
parameters associated with these switches. is finite due to the finite support condition. 




S $ ^db(G) 
S £ iPdb(G). 



I = {i | msTn(i ,n ,v ) £ 5 £ ipDB(Gt), 1 < t < T} 
d = {9i, v | msw(i,ra,w) £ S £ ipDB(Gt), 1 < t < T}. 
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Various probabilities and the Q function are computed by using Proposition A. 2 in 
Appendix A together with our assumptions as follows. 

P DB (G t \e) = Pdb(\J 'l>DB{G t )\o) = J2 P™( s \o) (8) 

p ms ,(s\e) = n e^ {s) 

iei,veVi 

X 

Q(9\9') d ^ f J2 E Pdb(S \G u O')lnP DB (S,G t \0) 

t=i se^DB 



iei,veVi iei,veVi 



where 



r,(i,v,0) d = f f 1 P™(S\°)°i,v(S) 

t=1 P DB (G t | 9) 5£WGt) 

Here we used Jensen's inequality to obtain (9). Note that PoBiGt | 9)~ x J2se^DB(G t ) 
Pmsu(S I d) a i,v{S) is the expected count of msw(i , • ,v ) in an SLD refutation of G t . Speaking 
of the likelihood function L(9) = YlJ=i PDB(Gt \ 9), it is already shown in Subsection 2.2 
(footnote) that Q(9 \ 9') > Q(9' \ 9') implies L{9) > L{9'). Hence from (9), we reach 
the procedure learn-naive(DB ,Q ') below that finds the MLE of the parameters. The array 
variable stores r](i,v,9) under the current 9. 

1: procedure learn-naive(DB , Q) begin 

2: Initialize 9 with appropriate values and e with a small positive number ; 
3: A(°) := Ya=i lnP DB (G t | 9); % Compute the log-likelihood. 

4: repeat 

5: foreach i £ /, v £ V t do 

T 1 

7: foreach i £ I, v £ V; do 



„.,.„ . ; % Update the parameters. 

to := to + 1; 

A (m) . = ^ 1 lnP DB (G t | 0) % Compute the log-likelihood again, 

until A^" 1 ) — \(' m ~ 1 ') < e % Terminate if converged, 
end 



This EM algorithm is simple and correctly calculates the MLE of 9, but the calcula- 
tion of PoBiGt | 9) and 7y[i, v](Line 3, 6 and 10) may suffer a combinatorial explosion of 
explanations. That is, \ipDB(Gt)\ often grows exponentially in the complexity of the model. 
For instance, \ipDB(Gt)\ for an HMM with N states is 0(N L ), exponential in the length L 
of an input /output string. Nonetheless, suppressing the explosion to realize efficient com- 
putation in a polynomial order is possible, under suitable conditions, by avoiding multiple 
computations of the same subgoal as we see next. 
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4.6 Inside Probability and Outside Probability for Logic Programs 

In this subsection, we generalize the notion of inside probability and outside probability 
(Baker, 1979; Lari & Young, 1990) to logic programs. Major computations in learn-naive(DB ,Q ) 
are those of two terms in Line 6, PoBiGt \ 0) and J2seipDB{Gt) ^sw(S \ 0)(Ji }V (S). Computa- 
tional redundancy lurks in the naive computation of both terms. We show it by an example. 
Suppose there is a propositional program DB P = F p U R p where F p = {a,b, c, d,m} and 

f <- aAg 
f ^b Ag 

g +- c (10) 
g^dAh 
h <— m. 



R p 



Here f is an observable atom. We assume that a, b, c, d and m are independent and also 
that {a,b} and {c,d} are pair- wise exclusive. Then the support set for f is calculated as 

?pDB p (f) ={aAc, aAdAm, bAc, bAdAm}. 

Hence, in light of (8), we may compute Pdb p (^) as 

P D B p (f) = PF p (aAc) + P Fp (aAdAm) + P Fp (bAc) + P Fp (bAdAm). (11) 

This computation requires 6 multiplications (because Pp (a A c) = Pp (a)Pp (c) etc.) and 
3 additions. On the other hand, it is possible to compute Pdb (f ) much more efficiently by 
factoring out common computations. Let A be a ground atom. Define the inside probability 
(5(A) of A as 

(3(A) d ^ P DB (A | 0). 24 (12) 
Then by applying Theorem A.l in Appendix A to 

comp(R p ) h f ^ (a A g) V (b A g), g^cV(dAh), h ^ m (13) 

which unconditionally holds in our semantics, and by using the independent and the ex- 
clusiveness assumption made on F p , the following equations about inside probability are 
derived. 

' /3(f) = /3(a)/3(g) + /3(b)/3(g) 
/3(g) = /3(c) + /3(d)/3(h) (14) 
/3(h) = /3(m) 

Pdb (f )(= /3(f)) is obtained by solving (14) about /3(f), for which only 3 multiplications 
and 2 additions are required. 

It is quite straightforward to generalize (14) but before proceeding, look at a program 
DB q = {m} U {g:-m A m, g:-m} where g is an observable atom and m the only msw atom. 
We have g ^ (m A m) V m in our semantics, but to compute P(g) = P(m)P(m) + P(m) is 
clearly wrong as it ignores the fact that clause bodies for g, i.e. mAm and m are not mutually 
exclusive, and atoms in the clause body mAm are not independent (here P(-) = Pdb (•)). 
Similarly, if we set a = b = c = d = m, the equation (14) will be totally incorrect. 

24. Note that if A is a fact in F, /3(A) = P ms *(A | 6). 
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We therefore add, temporarily in this subsection, two assumptions on top of the ex- 
clusiveness condition and the finite support condition so that equations like (14) become 
mathematically correct. The first assumption is that "clause" bodies are mutually exclu- 
sive i.e. if there are two clauses B <— W and B <— W , Pdb(W A W \ 0) = 0, and the 
second assumption is that body atoms are independent, i.e. if A <— B\ A • • • A Bk is a rule, 
Pdb(B 1 A • • • A B k | 0) = P DB (B 1 \ 9) • • • Pdb (B k \ 9) holds. 

Please note that "clause" used in this subsection has a special meaning. It is intended to 
mean G <— t where G is a goal and r is a tabled explanation for G obtained by OLDT search 
both of which will be explained in the next subsection. 25 In other words, these additional 
conditions are not imposed on a source program but on the result of OLDT search. So 
clauses for auxiliary computations do not need to satisfy them. 

Now suppose clauses about A occur in DB like 

A <- A • • • A B lM 

A <- B Ljl A • • • A B LjiL 

where Bhj (1 < h < L, 1 < j ' < ih) is an atom. Theorem A.l in Appendix A and the above 
assumptions ensure 

f 3(A) = f[f3(B 1 ,) + ...+ f[f3(B L ,). (15) 
j=i j=i 

(15) suggests that fi{G t ) can be considered as a function of fi(A) if these equations about 
inside probabilities are hierarchically organized in such a way that fi(Gt) belongs to the top 
layer and any fi(A) appearing on the left hand side only refers to /3(_B)'s which belong to the 
lower layers. We refer to this condition as the acyclic support condition. Under the acyclic 
support condition, equations of the form (15) have a unique solution, and the computation 
of Pdb(G I 9) via inside probabilities allows us to take advantage of reusing intermediate 
results stored as (3(A), thereby contributing to faster computation of PoB(Gt \ 0). 

Next we tackle a more intricate problem, the computation of ^5e^ DB (G t ) Pmsv(S \ 
0)(J hV (S). Since the sum equals J2 n J2msv<ii,n,v)eSeip DB (G t ) P ™™(S I #), we concentrate 
on the computation of 

fi(G t ,m)= J2 P^(S\0) 

mese4>DB(G t ) 

where m = msw(i ,n ,v ) . First we note that if an explanation S contains m like S = a\ A • • • A 
A m, then we have fi(S) = fi(a\) ■ ■ ■ /3(a/ l )/3(m). So fj,(Gt,m) is expressed as 

ti(G t ,m) = a(G t ,m)(3(m) (16) 

where a(G t ,m) = 9 g^'^ and a(G t ,m) does not depend on /3(m). Generalizing this obser- 
vation to arbitrary ground atoms, we introduce the outside probability of ground atom A 
w.r.t. G t by 

, r def 3^) 

25. The logical relationship (13) corresponds to (20) where f , g and h are table atoms. 
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assuming the same conditions as inside probability. In view of (16), the problem of comput- 
ing /i(Gt,m) is now reduced to that of computing a(G t ,m), which is recursively computable 
as follows. Suppose A occurs in the ground program DB like 

B r <— A f\ Wi } i, ■ ■ ■ ,Bi ^AAW Ul 

B K ±- A A Wk,i , • • • , B K ±- A A W K j K . 

As fi{G t ) is a function of (3(Bi), . . . , (3(Bk) by our assumption, the chain rule of derivatives 
leads to 

n(r A , _ ( dfi(G t ) \ f d{3(AAW hl ) \ f dfi(G t ) \ f d{3(AAW K , K ) 

a[UuA) U/3(i?i);V dfi(A) ) + "' + \dfi(B K ))\ dfi(A) 

and hence to 26 

a(G t ,G t ) = 1 (17) 

a(G u A) = a(G t ,B 1 )f2P(Wij) + --- + a(G t ,B K )f^P(W KJ ). (18) 

j=i j=i 

Therefore if all inside probabilities have already been computed, outside probabilities are 
recursively computed from the top (17) using (18) downward along the program layers. In 
the case of DB p with f and m being chosen atoms, we compute 

a(f,f) = 1 

a(f,g) = /3(a) + /3(b) 

a(f,h) = «(f,g)/3(d) 

a(f,m) = a(f,h). 

From (19), the desired sum /i(f ,m) is calculated as 

/x(f ,m) = a(f ,m)/3(m) = (/3(a) + /3(b))/3(d)/3(m) 

which requires only two multiplications and one addition compared to four multiplications 
and one addition in the naive computation. 

Gains obtained by computing inside and outside probability may be small for this case, 
but as the problem size grows, they become enormous, and compensate enough for addi- 
tional restrictions imposed on the result of OLDT search. 

4.7 OLDT Search 

To compute inside and outside probability recursively like (15) or (17) and (18), we need 
at programming level a tabulation mechanism for structure-sharing of partial explanations 



26. Because of the independence assumption on body atoms, Wh,j (1 < h < K, 1 < j ' < in) and A are 
independent. Therefore 

dfl(A A Wh,j) _ df}(A)f}(W hlj ) 

df)(A) ~ W{A) " mh ' j) - 
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between subgoals. We henceforth deal with programs DB in which a set table(DB) of table 
predicates are declared in advance. A ground atom containing a table predicate is called 
a table atom. The purpose of table atoms is to store their support sets and eliminate the 
need of recomputation, and by doing so, to construct hierarchically organized explanations 
made up of the table atoms and the msw atoms. 

Let DB = F msw U R be a parameterized logic program which satisfies the finite support 
condition and the uniqueness condition. Also let G\, G2, ■ ■ ■ , Gy be a random sample of 
observable atoms in obs(DB). We make the following additional assumptions. 

Assumptions: 

For each / (1 < / < T), there exists a finite set {t{, . . . , rj^} of table atoms associated 
with conjunctions S\ ■ (0 < k < K t , 1 < j < ra^) such that 

comp(R) h (G^^V.^VS^) ^ ^ 

A fa ~ ^,1 V • • • V S l mi ) A • • • A (rj <t ~ S* Kttl V ••• V 5},- jmA , ) 

where 

• each S\ ■ (0 < A; < ii-'t,l < j < m^.) is, as a set, a subset of _F msH U {r^+i, . . . ,rx t } 
(acyclic support condition). As a convention, we put tq = Gt and call respectively 

t 1 db A = {to, t\, . . . , Tj^} the se^ 0/ table atoms for G t and S l k ■ (k > 0) a t-explanation 
for t^. 27 The set of all t-explanations for is denoted by t/'-DfilYfc) and we consider 
V'-Db(') as a function of table atoms. 

• t-explanations are mutually exclusive, i.e. for each k (0 < k < Kt), Pdb{S\ j A S\ ■,) = 
(1 ^ J / j' ^ m k) (t~ exclusiveness condition). 

• S\ ■ (0 < k < K t ,l < j < rrik) is a conjunction of independent atoms (independent 
condition). 28 

These assumptions are aimed at efficient probability computation. Namely, the acyclic 
support condition makes dynamic programming possible, the t-exclusiveness condition re- 
duces Pe>b(AV B) to Pdb(A) + Pdb{B) and the independent condition reduces Pdb(A A B) 
t° Pdb(A)Pdb(B) ■ There is one more point concerning efficiency however. Note that the 
computation in dynamic programming proceeds following the partial order on rjj B 29 im- 
posed by the acyclic support condition and access to the table atoms will be much simplified 
if they are linearly ordered. We therefore topologically sort rjj B respecting the said partial 
order and call the linearized t 1 db satisfying the three assumptions (the acyclic support con- 
dition, the t-exclusiveness condition and the independent condition) a hierarchical system 
of t-explanations for Gt. We write it as t 1 db = [tq,t{, . . . , rj <t ) (ro = Gt) assuming iPdb(') is 
implicitly given. 30 Once a hierarchical system of t-explanations for G t is successfully built 

27. Prefix "t-" is an abbreviation of "tabfed-". 

28. The independence mentioned here oniy concerns positive propositions. For B\, B2 G head(DB), we say 
5i and B 2 are independent if P DB (5i A B 2 \ 0) = P DB (5i | 0)P D b(B 2 \ 0) for any 0. 

29. r 8 precedes t 3 if and only if the top-down execution of r t w.r.t. DB invokes tj directly or indirectly. 

30. So now it holds that if r t precedes tj then i < j. 
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from the source program, equations on inside probability and outside probability such as 
(14) and (19) are automatically derived and solved in time proportional to the size of the 
equations. It plays a central role in our approach to efficient EM learning. 

One way to obtain such t-explanations is to use OLDT search (Tamaki & Sato, 1986; 
Warren, 1992), a complete refutation method for logic programs. In OLDT search, when 
a goal G is called for the first time, we set up an entry for G in a solution table and store 
its answer substitutions G9 there. When a call to an instance G' of G occurs later, we stop 
solving G' and instead try to retrieve an answer substitution G9 stored in the solution table 
by unifying G' with G9. To record the remaining answer substitutions of G, we prepare a 
lookup table for G' and hold a pointer to them. 

For self-containedness, we look at details of OLDT search using a sample program 
DB h = F h UR h in Figure 4 31 which depicts an HMM 32 in Figure 3. This HMM has two states 
{sO, si}. At a state transition, it probabilistically chooses the next destination from {sO, si} 




Figure 3: Two state HMM 

{fl: values (init , [sO,sl]). 
f2: values (out (_) , [a,b] ) . 
f3: values(tr(_) , [sO.sl]). 



Rh 



hi: 



h2: 



hmm(Cs):- 7 To generate a string (chars) Cs, 

msw (init, once, Si) , 7 Set initial state to Si, and then 

hmm(l,Si,Cs) . 7 Enter the loop with clock = 1. 

hmm(T,S, [C|Cs]) :- T=<3, 7 Loop: 



h3: 



msw(out(S) ,T,C) , 
msw(tr(S) ,T,IextS) 
Tl is T+l, 
hmm(Tl,IextS,Cs) . 
hmm(T,_, []) :- T>3. 



7 Output C in state S. 

7 Transit from S to NextS. 

7 Put the clock ahead. 

7 Continue the loop (recursion) 

7 Finish the loop if clock > 3. 



Figure 4: Two state HMM program DB^ 



31. fl, f2, f3, hi, h2 and h3 are temporary marks, not part of the program. 

32. An HMM defines a probability distribution over strings in the given set of alphabets, and works as a 
stochastic string generator (Rabiner & luang, 1993) such that an output string is a sample from the 
defined distribution. 
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and also an alphabet from {a,b} to emit. Note that to specify a fact set and the associ- 
ated distribution compactly, we introduce here a new notation values (i, [fi , . . . ,f m ] ) . It 
declares that contains msw atoms of the form msw(i ,n,v ) (v £ {v\, . . . , v m }) whose distri- 
bution is P(i^ n ) given by (7) in Subsection 4.2. For example, (f 3), values (tr(_) , [sO,sl] ) 
introduces msw(tr(i) ,n,v) atoms into the program such that t can be any ground term, 
v £ {sO,sl} and for a ground term ra, they have a distribution 

^(tr(«, n )( msw ( tr ^) ,n,sO) = a;,msw(tr(i) ,ra,sl) = y \ 9 tyS o,0 t ,.si) = ^>(A>i 

where i = tr (t) , x, y £ {0, 1} and x + y = 1. 

This program runs like a Prolog program. For a non-ground top-goal <— hmm(S), it func- 
tions as a stochastic string generator returning a list of alphabets such as [a,b,a] in the 
variable S as follows. The top-goal calls clause (hi) and (hi) selects the initial state by ex- 
ecuting subgoal msw(init,once,Si) 33 which returns in Si an initial state probabilistically 
chosen from {s0, si}. The second clause (h2) is called from (hi) with ground S and ground 
T. It makes a probabilistic choice of an output alphabet C by asking msw (out (S) ,T,C) and 
then determines NextS, the next state, by asking msw(tr(S) ,T,NextS). (h3) is there to 
stop the transition. For simplicity, the length of output strings is fixed to three. This way 
of execution is termed as sampling execution because it corresponds to a random sampling 
from PDB h - If the top-goal is ground like <— hmm( [a,b , a] ) , it works as an acceptor, i.e. 
returning success (yes) or failure (no). 

If all explanations for hmm( [a,b , a] ) are sought for, we keep all msw atoms resolved upon 
during the refutation as a conjunction (explanation), and repeat this process by backtrack- 
ing until no more refutation is found. If we need t-explanations however, backtracking must 
be abandoned because sharing of partial explanations through t-explanations, the purpose 
of t-explanations itself, becomes impossible. We therefore instead use OLDT search for all 



top_hmm(Cs ,Ans) : - tab_hmm(Cs , Ans , [] ) . 
tab_hmm(Cs, [hmm(Cs) |X] ,X) :- hmm(Cs,_, [] ) . 
tab_hmm(T,S,Cs, [hmm(T,S,Cs) |X] ,X) :- hmm(T,S,Cs,_, [] ) 
e_msw(init,T,sO, [msw(init,T,sO) |X] ,X) . 



t7: hmm(Cs,X0,Xl) :- e_msw(init,once,Si,X0,X2) , tab_hmm(l,Si,Cs,X2,Xl) 
t8: hmm(T,S, [C|Cs] ,X0,X1) :- 

T=<3, e_msw(out(S) ,T,C,X0,X2) , e_msw(tr(S) ,T,IextS,X2,X3) , 

Tl is T+l, tab_hmm(Tl,IextS,Cs,X3,Xl) . 
t9: hmm(T,S, [] ,X,X) :- T>3. 



Figure 5: Translated program of DB^ 

33. If msw(i,ra,V) is called with ground i and ground n, V, a logical variable, behaves like a random variable. 
It is instantiated to some term v with probability 9 1<V selected from the value set V t declared by a values 
atom. If, on the other hand, V is a ground term v when called, the procedural semantics of msw(i , n ,v) 
is equal to that of msw(i , n , V) A V = v. 
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t-explanation search. In the case of the HMM program for example, to build a hierarchical 
system of t-explanations for hmm( [a,b , a] ) by OLDT search, we first declare hmm/1 and 
hmm/3 as table predicate. 34 So a t-explanation will be a conjunction of hmm/1 atoms, hmm/3 
atoms and msw atoms. We then translate the program into another logic program, analo- 
gously to the translation of definite clause grammars (DCGs) in Prolog (Sterling & Shapiro, 
1986). We add two arguments (which forms a D-list) to each predicate for the purpose of 
accumulating msw atoms and table atoms as conjuncts in a t-explanation. The translation 
applied to DB^ yields the program in Figure 5. 

In the translated program, clause (tl) corresponds to the top-goal <— hmm(/) with an 
input string /, and a t-explanation for the table atom hmm(/) will be returned in Ans. (t2) 
and (t3) are auxiliary clauses to add to the callee's D-list a table atom of the form hmm(/) 
and hmm(i,s,/) respectively (/: time step, s: state). In general, if p/ra is a table predicate 
in the original program, p/(n+ 2) becomes a table predicate in the translated program and 
an auxiliary predicate tab_p/(n + 2) is inserted to signal the OLDT interpreter to check the 
solution table for p/ra, i.e. to check if there already exist t-explanations for p/ra. Likewise 
clauses (t4) and (t4') are a pair corresponding to (f l) which insert msw(init,T,-) to the 
callee's D-list with T = once. Clauses (t7), (t8) and (t9) respectively correspond to (hi), 
(h2) and (h3). 

hmm ( [a, b, a] ) : [hmm ( [a, b, a] ) ] 

' [ [msw(init, once, sO) , hmm (1 , sO , [a, b, a] ) ] , 

[msw(init, once, si) , hmm (1 , si , [a, b, a] ) ] ] 

hmm(l, sO, [a, b, a] ) : [hmm (1, sO, [a, b, a] ) ] 

1 *- [ [msw(out (sO) , 1, a) , msw (tr (sO) , 1, qO) , hmm(2, sO, [b, a] ) ] , 

[msw(out (sO) , 1, a) , msw (tr (sO) , 1, si) , hmm(2, si, [b, a] ) ] ] 

hmm(l, si, [a, b, a] ) : [hmm (1, si, [a, b, a] ) ] 

1 » [ [msw(out (si) , 1, a) , msw (tr (si) , 1, sO) , hmm(2, sO, [b, a] ) ] , 

[msw(out (si) , 1, a) , msw (tr (si) , 1, si) , hmm(2, si, [b, a] ) ] ] 

hmm (2, sO, [b, a] ) : [hmm (2, sO, [b, a] ) ] 

1 » [ [msw(out (sO) , 2,b) , msw (tr (sO) , 2 , sO) , hmm(3, sO, [a] ) ] , 

[msw(out (sO) , 2,b) , msw (tr (sO) , 2 , si) , hmm(3, si, [a] ) ] ] 

hmm (2, si, [b, a] ) : [hmm (2, si, [b,a] ) ] 

1 » [ [msw(out (si) , 2,b) , msw (tr (si) , 2 , sO) , hmm(3, sO, [a] ) ] , 

[msw(out (si) ,2,b) , msw (tr (si) , 2 , si) , hmm(3, si, [a] ) ] ] 

hmm (3, sO, [a] ) : [hmm (3, sO, [a] ) ] 

' [ [msw (out (sO) , 3, a) , msw(tr (sO) , 3, sO) , hmm(4, sO, [] ) ] , 

[msw(out(sO) ,3, a) , msw(tr (sO) , 3, si) , hmm(4, si, [] ) ] ] 

hmm (3, si, [a] ) : [hmm (3, si, [a])] 

1 » [ [msw (out (si) , 3, a) , msw(tr (si) , 3, sO) , hmm(4, sO, [] ) ] , 

[msw (out (si) , 3, a) , msw(tr (si) , 3, si) , hmm (4, si, [] ) ] ] 

hmm(4,s0, []) : [hmm(4,s0, [])] 



hmm (4, si, []) : [hmm (4, si, [] ) ] 

' * [[]] 



Figure 6: Solution table 



34. In general, p/ra means a predicate p with arity ra. So although hmm/1 and hmm/3 share the predicate name 
hmm, they are different predicates. 
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Then after translation, we apply OLDT search to ^top_hmm( [a,b,a] ,Ans) while not- 
ing (i) the added D-list does not influence the OLDT procedure, and (ii) we associate with 
each solution of a table atom in the solution table a list of t-explanations. The resulting 
solution table is shown in Figure 6. The first row reads that a call to hmm( [a,b,a] ) oc- 
curred and entered the solution table and its solution, hiran( [a,b , a] ) (no variable bind- 
ing generated), has two t-explanations, msw(init,once,sO) A hmm(l,sO, [a,b,a] ) and 
msw(init,once,sl) A hmm(l,sl, [a,b,a] ). The remaining task is the topological sort- 
ing of the table atoms stored in the solution table respecting the acyclic support condition. 
This can be done by using depth-first search (trace) of t-explanations from the top-goal for 
example. Thus we obtain a hierarchical system of t-explanations for hmm( [a,b,a] ). 

4.8 Support Graphs 

Looking back, all we need to compute inside and outside probability is a hierarchical system 
of t-explanations, which essentially is a boolean combination of primitive events (msw atoms) 
and compound events (table atoms) and as such can be more intuitively representable as a 
graph. For this reason, and to help visualizing our learning algorithm, we introduce a new 
data-structure termed support graphs, though the new EM algorithm in the next subsection 
itself is described solely by the hierarchical system of t-explanations. 

As illustrated in Figure 7 (a), the support graph for G t is a graphical representation of 
the hierarchical system of t-explanations T f DB = (tq, t\, . . . , rj <t ) (tq = G t ) for G t in (20). 
It consists of totally ordered disconnected subgraphs, each of which is labeled with the 
corresponding table atom rj, in T f DB (0 < k < K t ). A subgraph labeled rj, comprises two 
special nodes (the start node and the end node) and explanation graphs, each corresponding 
to a t-explanation S\ ■ in ipDB( T k) (1 < J < m k)- 

An explanation graph of S\ ■ is a linear graph in which a node is labeled either with a 
table atom r or with a switch msw(-,-,-) in S\ ■. They are called a table node and a switch 
node respectively. Figure 7 (b) is the support graph for hiran( [a,b , a] ) obtained from the 
solution table in Figure 6. Each table node labeled r refers to the subgraph labeled r, so 
data-sharing is achieved through the distinct table nodes referring to the same subgraph. 

4.9 Graphical EM Algorithm 

We describe here an efficient EM learning algorithm termed the graphical EM algorithm 
(Figure 8) introduced by Kameya and Sato (2000), that runs on support graphs. Suppose 
we have a random sample Q = G\, . . . ,Gj of observable atoms. Also suppose support 
graphs for G t (1 < t < T), i.e. hierarchical systems of t-explanations satisfying the acyclic 
support condition, the t-exclusiveness condition and the independent condition, have been 
successfully constructed from a parameterized logic program DB satisfying the uniqueness 
condition and the finite support condition. 

The graphical EM algorithm refines learn-naive(DB ,Q ) by introducing two subroutines, 
get-mside-probs(DB , Q) to compute inside probabilities and get-expectations(DB, Q) to com- 
pute outside probabilities. They are called from the main routine learn- gEM(DB,G). When 
learning, we prepare four arrays for each support graph for G t in Q: 

• ^t^ 7 "] f° r the inside probability of r, i.e. /3(r) = Pdb( t \ 0) ( see (12)) 
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(a) 



Q) "v^w) KTj) ex Pl anat i° n graph 




(b) 



msw ( init , once, sO ) hmm(l,sO, [a,b,a]) 

hmm( [a,b,a] ) : 

o 

msw (init, once, si) hmm(l,sl, [a,b,a]) 




msw (out (sO) , 1, a) msw (tr (sO) , 1 , sO) hmm (2 , sO , [b, a] ) 

hmm(l, sO, [a,b,a] ) : 




o— o 




msw(out (sO) , 1, a) msw (tr (sO) , 1 , si) hmm (2 , si , [b, a] ) 



msw(out (si) , 1, a) msw (tr (si) , 1 , sO) hmm (2 , sO , [b, a] ) 

hmm(l, si, [a,b,a] ) : 





msw (out (si) , 1, a) msw (tr (si) , 1 , si) hmm (2 , si , [b, a] ) 



gure 7: A support graph (a) in general form, (b) for G t = hmm( [a,b,a] ) in the HMM 
program DB^. A double- circled node refers to a table node. 



• Q[^ r ] f° r the outside probability of r w.r.t. G t , i.e. a(G t ,T) (see (17) and (18)) 

• lZ[t,T,S] for the explanation probability of S (E iPdb( t I))^ i- e - Pdb(S I $) 
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1 


procedure learn-gEM(DB ,Q) 


1 


procedure get-inside-probs(DB , Q) 


2 


begin 


2 


begin 


3 




Select some as initial 


3 


for t := 1 to T do begin 






parameters; 


4 


Let r * = G t ; 


4 




get-inside-probs(DB , (/); 


5 


for k := .Kj downto do begin 


5 




A (0) := Et=iln7>[^t]; 


6 


V[t,rl] := 0;^ 


6 




repeat 


7 


foreach S £ ipDB( T k) do begin 


7 




get-expectations(L>B ,y j; 


8 


Let 5 = {Ai,A 2 , A-,}; 


8 




foreach i G /, u 6 do 




9 




7y[i,-y] : = 


9 


ft[i,r£,S] := lj 






Ef=i J 7 [i,i,t;]/7'[i,G t ]; 


10 


for / := 1 to 15*1 do 


10 




foreach i £ /, v £ do 


11 


if A\ = msw(i, • ,v ) then 


11 




e hV := ^[^/E^'ev; 
get-inside-probs(DB 


12 


n[t,Tl,s]*= t)V 


12 




13 


else U[t,Tl,S\*= V[t,A,]; 


13 




m := to + 1; 


14 


V[t,rl]+= K[tA,S] 


14 




A (m) := £f =1 ln7>[i,G t ] 


15 


end /* foreach S */ 


1 K 
10 




until A( m ) - A^" 1 ) < e 


16 


end /* for A; */ 


16 


end. 


17 


end /* for t */ 








18 


end. 




1 


procedure get-expectations(DB , (/) begin 




2 


for t := 1 to T do begin 








3 


foreach i £ /, v £ V; do r][t, i, v] := 0; 




4 


Let r* = G t ; Q[t,T*] := 1.0; 








5 


for A; := 1 to A' t do Q[i,r^] := 


0; 






6 


for k := to ii^ do 








7 


foreach 5* £ ipDB{ T k) ^° begin 






8 


Let S = {Ai,A 2 , . . . , -4|5|}; 








9 


for / := 1 to 15*1 do 








10 


if A; = msw(i,-,-u) then rj[t,i, v] += Q[i,r^] • 7^,r^, S] 




11 


else Q[i,A,]+= r*] • 


n[t,T 






12 


end /* foreach S */ 








13 


end /* for / */ 








14 


end. 







Figure 8: graphical EM algorithm. 



• r)[t,i,v] for the expected count of msw(i,-,w), i.e. Esefe^Gt) -f > msw(5' | 0)vi,v{S) 

and call the procedure learn- gEM(DB,G ) in Figure 8. The main routine learn- gEM(DB,G ) ini- 
tially computes all inside probabilities (Line 4) and enters a loop in which get-expectations(DB,G ) 
is called first to compute the expected count r][t,i,v] of msw(i,-,f) and parameters are up- 
dated (Line 11). Inside probabilities are renewed by using the updated parameters before 
entering the next loop (Line 12). 
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The subroutine get-inside-probs(DB ,G ) computes the inside probability /3(r) = Pdb( t \ 0) 
(and stores it in V[t, r]) of a table atom r from the bottom layer up to the topmost layer To = 
Gf (Line 4) of the hierarchical system of t-explanations for G t (see (20) in Subsection 4.6). 
It takes a t-explanation S in iPdb( t D one by one (Line 7), decomposes S into conjuncts and 
multiplies their inside probabilities which are either known (Line 12) or already computed 
(Line 13). 

The other subroutine get-expectations(DB,G ) computes outside probabilities following the 
recursive definitions (17) and (18) in Subsection 4.6 and stores the outside probability 
a(Gt,T) of a table atom r in Q[/, r]. It first sets the outside probability of the top-goal 
t = G t to 1.0 (Line 4) and computes the rest of outside probabilities (Line 6) going down 
the layers of the t-explanation for G t described by (20) in Subsection 4.6. (Line 10) adds 
Q[t,Tj,] • TZ[t,rl, S] = a(Gf,Tl) • fl(S) to r][t,i,v], the expected count of msw(i,-,f), as a 
contribution of msw(i,-,f ) in S through t\ to r][t,i,v]. (Line 11) increments the outside 
probability = a(G t ,Ai) of A\ according to the equation (18). Notice that Q[t,Tj,] 

has already been computed and lZ[t, rf., S]/V[t, A{\ = fi{W) for S = A\ A W. As shown in 
Subsection 4.5, learn-naive(DB ,Q ) is the MLE procedure, hence the following theorem holds. 

Theorem 4.1 Let DB be a parameterized logic program, and Q = G\, . . . ,Gt a ran- 
dom sample of observable atoms. Suppose the five conditions (uniqueness, finite support 
(Subsection 4-2), acyclic support, t-exclusiveness and independence (Subsection 4-7)) are 
met. Then learn-gEM (DB,G) finds the MLE 0* which (locally) maximizes the likelihood 

L(g\o) = YiI =1 PDB(G t \o). 

(Proof) Sketch. 35 Since the main routine learn-gEM(DB,G) is the same as learn-naive(DB,Q) 
except the computation of r][i, v] = J2t=i ^[^ h v ]> we show that r][t,i,v] = J2seipDB{Gt) Pms*(S 
0)<7 t)V (S) (= Y.nY.msvti,n,v)eSeiPDB(G t ) p ms«( s I d ))- However, 

V [t,i,v] = E E H_ a(G u ri)(3(S) 

0<k<K t n msw ( l;ra; „) e 5 e ^, DB ( T t) 

(see (Line 10) in get-expectations(DB 
= y^ct((jf,msw(i,ra,f ))/3(msw(i,ra,f )) 

n 

= E A*(<Jt,msw(i ,n ,v)) (see the equation (16)) 

n 

= E E P^(S\6). Q.E.D. 

Here we used the fact that if S contains msw(i,ra ,v ) like S = S' A msw(i,ra ,v ) , fi{S) = 
/3(5')/3(msw(i,ra,f )) holds, and hence 

a(G t y k )[i(S) = a(G t y k )[i(S')[i(m^{i jn ,v)) 

= (contribution of msw(i,ra,f) in 5 through t\ to a(Gt,msw(i ,n ,v )))/3(msw(i,ra,f )). 

35. A formal proof is given by Kameya (2000). It is proved there that under the common parameters 9, r][i, v] 
in learn-naive(DB ,Q ) coincides with v] in learn-gEM (DB ,Q ). So, the parameters are updated to 
the same values. Hence, starting with the same initial values, the parameters converge to the same 
values. 
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The five conditions on the applicability of the graphical EM algorithm may look hard 
to satisfy at once. Fortunately, the modeling principle in Section 4.3 still stands, and with 
due care in modeling, it is likely to lead us to a program that meets all of them. Actually, 
we will see in the next section, programs for standard symbolic-statistical frameworks such 
as Bayesian networks, HMMs and PCFGs all satisfy the five conditions. 



5. Complexity 

In this section, we analyze the time complexity of the graphical EM algorithm applied 
to various symbolic-statistical frameworks including HMMs, PCFGs, pseudo PCSGs and 
Bayesian networks. The results show that the graphical EM algorithm is competitive with 
these specialized EM algorithms developed independently in each research field. 



5.1 Basic Property 

Since the EM algorithm is an iterative algorithm and since we are unable to predict when 
it converges, we measure time complexity by the time taken for one iteration. We therefore 
estimate time per iteration on the repeat loop of learn-gEM(DB , Q) (Q = G\, . . . , Gy). We 
observe that in one iteration, each support graph for G t (1 < t < T) is scanned twice, once 
by get-inside-probs(DB ,Q) and once by get-expectations(DB ,Q). In the scan, addition is 
performed on the t-explanations, and multiplication (possibly with division) is performed 
on the msw atoms and table atoms once for each. So time spent for Gt per iteration by the 
graphical EM algorithm is linear in the size of the support graph, i.e. the number of nodes 
in the support graph for Gt. Put 

ALb = U $db(t) 



T ^ T< DB 



def 



max I A no I 

KKT 



^ maxsize 



def I ~ I 

= m <i x _ Pi 

l<t<X,SeA^ B 



Recall that T f DB is the set of table atoms for G t , and hence A f DB is the set of all t-explanations 
appearing in the right hand side of (20) in Subsection 4.7. So ^ num is the maximum number 
of t-explanations in a support graph for the Gj's and ^ maxs i ze the maximum size of a t- 
explanation for the Gf's respectively. The following is obvious. 

Proposition 5.1 The time complexity of the graphical EM algorithm per iteration is linear 
in the total size of support graphs, 0(^ num ^maxsize? 1 ) in notation, which coincides with the 
space complexity because the graphical EM algorithm runs on support graphs. 



This is a rather general result, but when we compare the graphical EM algorithm with 
other EM algorithms, we must remember that the input to the graphical EM algorithm is 
support graphs (one for each observed atom) and our actual total learning time is 

OLDT time + (the number of iterations) xO(^ nmn ^ maxs i ze r) 
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where "OLDT time" denotes time to construct all support graphs for Q. It is the sum of 
time for OLDT search and time for the topological sorting of the table atoms, but because 
the latter is part of the former order-wise, 36 we represent "OLDT time" by time for OLDT 
search. Also observe that the total size of support graphs does not exceed time for OLDT 
search for Q order- wise. 

To evaluate OLDT time for a specific class of models such as HMMs, we need to know 
time for table operations. Observe that our OLDT search in this paper is special in the 
sense that table atoms are always ground when called and there is no resolution with solved 
goals. Accordingly a solution table is used only 

• to check if a goal G already has an entry in the solution table, i.e. if it was called 
before, and 

• to add a new searched t-explanation for G to the list of discovered t-explanations 
under G"s entry. 

The time complexity of these operations is equal to that of table access which depends 
both on the program and on the implementation of the solution table. 37 We first suppose 
programs are carefully written in such a way that the arguments of table atoms used as in- 
decies for table access are integers. Actually all programs used in the subsequent complexity 
analysis (DB^ in Subsection 4.7, DB g and DB g t in Subsection 5.3, DBqt in Subsection 5.5) 
satisfy or can satisfy this condition by replacing non-integer terms with appropriate inte- 
gers. We also suppose that the solution table is implemented using an array so that table 
access can be done in 0(1) time. 38 

In what follows, we present a detailed analysis of the time complexity of the graphical 
EM algorithm applied to HMMs, PCFGs, pseudo PCSGs and Bayesian networks, assuming 
0(1) time access to the solution table. We remark by the way that their space complexity 
is just the total size of solution tables (support graphs). 

5.2 HMMs 

The standard EM algorithm for HMMs is the Baum- Welch algorithm (Rabiner, 1989; Ra- 
biner & Juang, 1993). An example of HMM is shown in Figure 3 in Subsection 4. 7. 39 Given 
T observations w 1 , . . . , w T of output string of length i, it computes in 0(N 2 LT) time in 
each iteration the forward probability a^q) = P(o\o\ ■ ■ ■ o t m _ 1 , q \ 9) and the backward 
probability fi^iq) = P(o t m o t m+1 ■ ■ ■ o l L \ q, 9) for each state q £ Q, time step m (1 < m < L) 
and a string w t = o\o\ ■ ■ ■ (1 < t < T), where Q is the set of states and N the number of 
states. The factor N 2 comes from the fact that every state has N possible destinations and 

36. Think of OLDT search for a top-goal Gt. It searches for msw atoms and table atoms to create a so- 
lution table, while doing some auxiliary computations. Therefore its time complexity is never less 
than 0(|the number of msw atoms and table atoms in the support graph for Gt\), which coincides with 
the time we need to topologically sort table atoms in the solution table by depth-first search from To = Gt. 

37. Sagonas et al. (1994) and Ramakrishnan et al. (1995) discuss about the implementation of OLDT. 

38. If arrays are not available, we may be able to use balanced trees, giving O(logra) access time where n 
is the number data in the solution table, or we may be able to use hashing, giving average O(l) time 
access under a certain condition (Cormen, Leiserson, & Rivest, 1990). 

39. We treat here only "state-emission HMMs" which emit a symbol depending on the state. Another type, 
"arc-emission HMMs" in which the emitted symbol depends on the transition arc, is treated similarly. 
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we have to compute the forward and backward probability for every destination and every 
state. After computing all a^(g)'s and /3^(g)'s, parameters are updated. So, the total 
computation time in each iteration of the Baum- Welch algorithm is estimated as 0(N 2 LT) 
(Rabiner & Juang, 1993; Manning & Schiitze, 1999). 

To compare this result with the graphical EM algorithm, we use the HMM program 
DBh in Figure 4 with appropriate modifications to L, the length of a string, Q, the 
state set, and declarations in Fh for the output alphabets. For a string w = o\Oi •••ol, 
hmm(n ,q, [o m , o m+ i, . . . , o/] ) in DBh reads that the HMM is in state q £ Q at time n and 
has to output [o m ,o m+ i , . . . ,o/] until it reaches the final state. After declaring hmm/1 and 
hmm/3 as table predicate and translation (see Figure 5), we apply OLDT search to the goal 
<— top_hmm( [oi , . . . ,o/] ,Ans) w.r.t. the translated program to obtain all t-explanations 
for hmm([oi, . . . ,o/]) . For a complexity argument however, the translated program and 
DBh are the same, so we talk in terms of DBh f° r the sake of simplicity. In the search, 
we fix the search strategy to multi-stage depth- first strategy (Tamaki & Sato, 1986). We 
assume that the solution table is accessible in 0(1) time. 40 Since the length of the list in 
the third argument of hmm/3 decreases by one on each recursion, and there are only finitely 
many choices of the state transition and the output alphabet, the search terminates, leaving 
finitely many t-explanations in the solution table like Figure 6 that satisfy the acyclic sup- 
port condition respectively. Also the sampling execution of ^hmm(L) w.r.t. DBh lS nothing 
but a sequential decision process such that decisions made by msw atoms are exclusive, 
independent and generate a unique string, which means DBh satisfies the t-exclusiveness 
condition, the independence condition and the uniqueness condition respectively. So, the 
graphical EM algorithm is applicable to the set of hierarchical systems of t-explanations for 
hmm(w/) (1 < / < T) produced by OLDT search for T observations w 1 , . . . , w T of output 
string. Put w l = o\o\ ■ ■ ■ o l L . It follows from 



T t DBh = {hmm(m,g, [o^ , . . . ,o^]) | 1 < m < L + 1, q £ Q} U {hmm([oj, . . . ,o^])} 



that for a top-goal hmm( [_o\ , . . . ,o^] ) , there are at most 0(N L) calling patterns of hmm/3 
and each call causes at most N calls to hmm/3, implying there occur 0(NL ■ N) = 0(N 2 L) 
calls to hmm/3. Since each call is computed once due to the tabling mechanism, we have 
^ nmn = 0(N 2 L). Also ^ maxs i Z e = 3. Applying Proposition 5.1, we reach 

Proposition 5.2 Suppose we have T strings of length L. Also suppose each table operation 
in OLDT search is done in 0(1) time. OLDT time by DBh is 0(N 2 LT) and the graphical 
EM algorithm takes 0(N 2 LT) time per iteration where N is the number of states. 

0(N 2 LT) is the time complexity of the Baum- Welch algorithm. So the graphical EM 
algorithm runs as efficiently as the Baum- Welch algorithm. 41 



40. O(l) is possible because in the translated program DBh in Section 4.7, we can identify a goal pattern of 
hmm(- •) by the first two arguments which are constants (integers). 

41. Besides, the Baum- Welch algorithm and the graphical EM algorithm whose input are support graphs 
generated by DBh update parameters to the same value if initial values are the same. 



tl) DBh {' ] asm{m,q, [o^ , . 

(l<m<L) 




msw ( out (q) ,m,o m ) , msw (tr(g) ,m,q'), 
hmm(m + l,q' , [o^ +1 , . . . ,o^] ) 



q'eQ 
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By the way, the Viterbi algorithm (Rabiner, 1989; Rabiner & Juang, 1993) provides for 
HMMs an efficient way of finding the most likely transition path for a given input /output 
string. A similar algorithm for parameterized logic programs that determines the most 
likely explanation for a given goal can be derived. It runs in time linear in the size of the 
support graph, thereby 0(N 2 L) in the case of HMMs, the same complexity as the Viterbi 
algorithm (Sato & Kameya, 2000). 



5.3 PCFGs 

We now compare the graphical EM algorithm with the Inside- Outside algorithm (Baker, 
1979; Lari & Young, 1990). The Inside- Outside algorithm is a well-known EM algorithm 
for PCFGs (Wetherell, 1980; Manning & Schiitze, 1999). 42 It takes a grammar in Chomsky 
normal form. Given N nonterminals, a production rule in the grammar takes the form 
i — ► j,k (1 < i,j,k < N) (nonterminals are named by numbers from 1 to N and 1 is a 
starting symbol) or the form i — ► w where 1 < i < N and to is a terminal. In each iteration, 
it computes the inside probability and the outside probability of every partial parse tree 
of the given sentence to update parameters for these production rules. Time complexity is 
measured by time per iteration, and is described by N, the number of nonterminals, and 
L, the number of terminals in a sentence. It is 0(N 3 L 3 T) for T observed sentences (Lari 
k Young, 1990). 

To compare the graphical EM algorithm with the Inside-Outside algorithm, we start 
from a propositional program DB g = F g U R g below representing the largest grammar 
containing all possible rules i — ► j,k in N nonterminals where nonterminal 1 is a starting 
symbol, i.e. sentence. 



F„ 



{msw(i, [d,d'] , \_j ,k~]) | 1 < i,j,k < N,d,d' are numbers} 

U {msw(i ,d ,w) | 1 < i < N, d is a number, w is a terminal} 



R„ 



q(i,d ,d 2 ) :- msw(i, ld ,d 2 '] , Lj ,k~] ) , 

q(i»4.^i) » 

q(k,di,d 2 ) . 



1 < i,j,k < N, 

< d < di < d 2 < L 



U | q(i,d,d+l) :- msw(i,d,w d+1 ) . 1 < i < N, < d < L - lj 



Figure 9: PCFG program DB 



DB g is an artificial parsing program whose sole purpose is to measure the size of an 
OLDT tree 43 created by the OLDT interpreter when it parses a sentence w\w 2 ■ ■ ■ wl. So 



42. A PCFG (probabilistic context free grammar) is a backbone CFG with probabilities (parameters) as- 
signed to each production rule. For a nonterminal A having n production rules {A — > ct t \ 1 < i < n}, a 
probability p, is assigned to A — > a, (1 < i < n) where Pi = 1. The probability of a sentence s is 
the sum of probabilities of each (leftmost) derivation of s. The latter is the product of probabilities of 
rules used in the derivation. 

43. To be more precise, an OLDT structure, but in this case, it is a tree because DB g contains only constants 
(Datalog program) and there never occurs the need of creating a new root node. 
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Figure 10: OLDT tree for the query <— q(l ,d ,L) 



the input sentence W\W2 ■ ■ -wl is embedded in the program separately as msw(i ,d ,Wd+\) 
(0 < d < L— 1) in the second clauses of R g (this treatment does not affect the complexity ar- 
gument). q(i ,do ,d\) reads that the i-th nonterminal spans from position do to position d\, 
i.e. the substring w^+i ■■■ w ( i 1 . The first clauses qCi.rfo,^) : ~ msw(-,-,-), q(j,<io>^i)> 
q(k,di ,(^2) are supposed to be textually ordered according to the lexicographic order for 
tuples (i,j,k,do,d,2,di). As a parser, the top-goal is set to <— q(l,0,i). 44 It asks the 
parser to parse the whole sentence W1W2 ■ ■ - wl as the syntactic category "1" (sentence). 

We make an exhaustive search for this query by OLDT search. 45 As before, the multi- 
stage depth-first search strategy and 0(1) time access to the solution table are assumed. 
Then the time complexity of OLDT search is measured by the number of nodes in the 
OLDT tree. Let Tj^ be the OLDT tree for ^q(k,d,L). Figure 10 illustrates rj 1 ^ for d 
(0 < d < L — 3) where msw atoms are omitted. As can be seen, the tree has many similar 
subtrees, so we put them together (see Note in Figure 10). Due to the depth-first strategy, 
rj 1 ^ has a recursive structure and contains as a subtree. Nodes whose leftmost atom 

is not underlined are solution nodes, i.e. they solve their leftmost atoms for the first time in 
the entire refutation process. The underlined atoms are already computed in the subtrees 
to their left. 46 They only check the solution table if there are their entries (= already 



44. L here is not a Prolog variable but a constant denoting the sentence length. 

45. q is a table predicate. 

46. It can be inductively proved that contains every computed q(i , d' , d") (0 < d < L — 3, d+1 < d' < 
d" < L, 1 < i < N, d" - d' < L - d - 2). 
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computed) in 0(1) time. Since all clauses are ground, such execution only generates a 
single child node. 

We enumerate the number of nodes in but not in (1 < k < N). From 

Figure 10, we see = 0(N 3 (L - d) 2 ). 47 Let (2 < k < N) be the number of nodes in 

not contained in Tj^. It is estimated as 0(N 2 (L — d — 2)). Consequently, the number 

of nodes that are newly created in is + X!fc=2 = 0(N 3 (L — d) 2 ). As a result, 
total time for OLDT search is computed as Yld=o = 0(N 3 L 3 ) 48 which is also the size of 
the support graph. 

We now consider a non-proposition al parsing program DB g i = F g i U R g i in Figure 11 
whose ground instances constitute the propositional program DB g . DB g t is a probabilistic 
variant of DCG program (Pereira & Warren, 1980) in which q'/l? and between/3 are 
declared as table predicate. Semantically DB g t specifies a probability distribution over the 
atoms of the form {q' (/) | / is a list of terminals}. 



{msw(s 8 - ,t , [sj ) | 1 < i,j,k < N,t is a number} 

U {msw(s 8 - ,t , w) | 1 < i < N,t is a number, w is a terminal} 

q'(S) :- length(S.D), q'(si,0,D,0,_,S-[]). 

q' (I,D0,D2,C0,C2,L0-L2) :- between(D0 ,D1 ,D2) , 

msw(I,C0, [J,K] ) , 
q' (J,D0,Dl,s(C0) ,C1,L0-L1) , 
q' (K,Dl,D2,s(Cl) ,C2,L1-L2) . 

q' (I,D,s(D) ,C0,s(C0) , [W|X]-X) :- msw(I , CO , W) . 



Figure 11: Probabilistic DCG like program DB g t 

The top-goal to parse a sentence S = [wi, . . . , wl~] is <— q' ( [w\, . . . , wl~] ) . It invokes 
q' (si,0,D,0,_, [wi , . . .,wl~] - [] ) after measuring the length D of an input sentence S by 
calling length/ 2. 49 50 In general, q' (i ,do ,di ,cq ,ci J0-I2) works identically to q(i ,do ,(^2) 
but three arguments, Co, C2 and lo~h, are added. Co supplies a unique trial-id for msws to be 
used in the body, C2 the latest trial-id in the current computation, and /0-/2 a D-list holding 
a substring between do and c?2- Since the added arguments do not affect the shape of the 

47. We here focus on the subtree T' d . j, 1 and j' range from 1 to N , and | {(e, e') | d + 2 < e' < e < L — 1 } | = 
0((L — d) 2 ). Hence, the number of nodes in T' d is 0(N 3 (L — d) 2 ). The number of nodes in T d ^ but 
neither in nor in T' d is negligible, therefore h d 1] = 0(N 3 (L - df ). 

48. The number of nodes in T^}_ t and T^}_ 2 is negligible. 

49. To make the program as simple as possible, we assume that an integer n is represented by a ground term 

(n) 

s n = s (• • s (0) • • •) . We also assume that when DO and D2 are ground, the goal ^between(D0 , Dl , D2) 
returns an integer Dl between them in time proportional to |D1 — DO | . 

50. We omit an obvious program for length(l,s n ) which computes the length s n of a list I in 0(|l|) time. 
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search tree in Figure 10 and the extra computation caused by length/2 is O(L) and the 
one by the insertion of between(D0 ,D1 ,D2) is 0(NL 3 ) respectively, 51 OLDT time remains 
0(N 3 L 3 ), and hence so is the size of the support graph. 

To apply the graphical EM algorithm correctly, we need to confirm the five conditions 
on its applicability. It is rather apparent however that the OLDT refutation of any top- 
goal of the form <— q' ( [_w\ ,. . .,wl] ) w.r.t. DB g t terminates, and leaves a support graph 
satisfying the finite support condition and the acyclic support condition. The t-exclusiveness 
condition and the independent condition also hold because the refutation process faithfully 
simulates the leftmost stochastic derivation of w\ ■ ■ ■ wl in which the choice of a production 
rule made by msw(s 8 ,s c , [sj.Sfc]) is exclusive and independent (trial-ids are different on 
different choices). 

What remains is the uniqueness condition. To confirm it, let us consider another pro- 
gram DBgtt, a modification of DB g t such that the first goal length(S,D) in the body of the 
first clause and the first goal between(D0,Dl ,D2) in the second clause of R g t are moved to 
the last position in their bodies respectively. DB g tt and DB g t are logically equivalent, and 
semantically equivalent as well from the viewpoint of distribution semantics. Then think of 
the sampling execution by the OLDT interpreter of a top-goal <— q' (S) w.r.t. DB g tt where 
S is a variable, using the multi-stage depth-first search strategy. It is easy to see first that 
the execution never fails, and second that when the OLDT refutation terminates, a sentence 
Lw\, . . . , wl] is returned in S, and third that conversely, the set of msw atoms resolved upon 
in the refutation uniquely determines the output sentence [_w\, . . . , wl] - 52 Hence, if the 
sampling execution is guaranteed to always terminate, every sampling from Pf n (= Pf i) 
uniquely generates a sentence, an observable atom, so the uniqueness condition is satisfied 
by DBgtt, and hence by DB g t. 

Then when is the sampling execution guaranteed to always terminate? In other words, 
when does the grammar only generate finite sentences? Giving a general answer seems 
difficult, but it is known that if the parameter values in a PCFG are obtained by learning 
from finite sentences, the stochastic derivation by the PCFG terminates with probability 
one (Chi & Geman, 1998). In summary, assuming appropriate parameter values, we can 
say that the parameterized logic program DB g t for the largest PCFG with N nonterminal 
symbols satisfies all applicability conditions, and the OLDT time for a sentence of length 
L is 0(N 3 L 3 ) 53 and this is also the size of the support graph. From Proposition 5.1, we 
conclude 

Proposition 5.3 Let DB be a parameterized logic program representing a PCFG with N 
nonterminals in the form of DB' g in Figure 11, and Q = G\, Gi, ■ ■ ■ , Gj be the sampled atoms 
representing sentences of length L. We suppose each table operation in OLDT search is done 
in 0(1) time. Then OLDT search for Q and one iteration in learn-gEM are respectively 
done in 0(N 3 L 3 T) time. 

51. between(D0,Dl,D2) is called 0{N{L-df ) times in . So it is called J2d=o 0{N{L-df) = 0(NL 3 ) 
times in T^K 

52. Because the trial-ids used in the refutation record which rule is used at what step in the derivation of 
wi ■ ■ ■ wl- 

53. In DB g i, we represent integers by ground terms made out of and s(-) to keep the program short. If 
we use integers instead of ground terms however, the first three arguments of q '(•,-,•,•,•,• ) are enough 
to check whether the goal is previously called or not, and this check can be done in O(l) time. 
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0(N 3 L 3 T) is also the time complexity of the Inside-Outside algorithm per iteration 
(Lari & Young, 1990), hence our algorithm is as efficient as the Inside-Outside algorithm. 

5.4 Pseudo PCSGs 

PCFGs can be improved by making choices context-sensitive, and one of such attempts is 
pseudo PCSGs (pseudo probabilistic context sensitive grammars) in which a rule is cho- 
sen probabilistically depending on both the nonterminal to be expanded and its parent 
nonterminal (Charniak & Carroll, 1994). 

A pseudo PCSG is easily programmed. We add one extra-argument, N, representing 
the parent node, to the predicate q' (I,D0,D2,C0,C2,L0-L2) in Figure 11 and replace 
msw(I,C0, [J,K] ) withmsw( [N,I] ,C0, [J,K] ). Since the (leftmost) derivation of a sentence 
from a pseudo PCSG is still a sequential decision process described by the modified program, 
the graphical EM algorithm applied to the support graphs generated from the modified 
program and observed sentences correctly performs the ML estimation of parameters in the 
pseudo PCSG. 

A pseudo PCSG is thought to be a PCFG with rules of the form [n,i] — ► 
(1 < n, k < N) where n is the parent nonterminal of i, so the arguments in the previous 
subsection are carried over with minor changes. We therefore have (details omitted) 

Proposition 5.4 Let DB be a parameterized logic program for a pseudo PCSG with N 
nonterminals as shown above, and Q = Gi, G2, . . . , Gy the observed atoms representing 
sampled sentences of length L. Suppose each table operation in OLDT search can be done 
in 0(1) time. Then OLDT search for Q and each iteration in learn-gEM is completed in 
0(N 4 L 3 T) time. 

5.5 Bayesian Networks 

A relationship between cause C and its effect E is often probabilistic such as the one be- 
tween diseases and symptoms, and as such it is mathematically captured as the conditional 
probability P(E = e \ C = c) of effect e given the cause c. What we wish to know however is 
the inverse, i.e. the probability of a candidate cause c given evidence e, i.e. P(C = c \ E = e) 
which is calculated by Bayes' theorem as P(E = e \ C = c)P(C = c)/ J2 C ' P(E = e \ C = 
c')P(C = c'). Bayesian networks are a representational/computational framework that fits 
best this type of probabilistic inference (Pearl, 1988; Castillo et al., 1997). 

A Bayesian network is a graphical representation of a joint distribution P(X\ = x\, . . . , 
Xjv = £jv) °f finitely many random variables X\, . . . ,Xjv- The graph is a dag (directed 
acyclic graph) such as ones in Figure 12, and each node is a random variable. 54 

In the graph, a conditional probability table (CPT) representing P(Xi = X{ \ 11; = u{) 
(1 < i < N) is associated with each node X{ where 11; represents X^s parent nodes and U{ 
their values. When X{ has no parent, i.e. a topmost node in the graph, the table is just a 
marginal distribution P(X{ = x{). The whole joint distribution is defined as the product of 



54. We only deal with discrete cases. 
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( g i) Singly-connected ( G 2 ) Multiply-connected 

Figure 12: Bayesian networks 

these conditional distributions: 

N 

p{x 1 = Xl ,...,x N = x N f 5 = n p(Xi = xi I n 8 = Ui ). (21) 

8 = 1 

Thus the graph G\ in Figure 12 defines 

P Gl (a,b,c,d,e,f) = P Gl (a)P Gl (b)P Gl (c \ a)P Gl (d \ a,b)P Gl (e \ d)P Gl (f \ d) 

where a, 6, c, d, e and / are values of corresponding random variables A, B, C, D, E 
and F, respectively. 56 As mentioned before, one of the basic tasks of Bayesian networks 
is to compute marginal probabilities. For example, the marginal distribution P Gl (c,d) is 
computed either by (22) or (23) below. 

P Gl (c,d) = PG 1 (a)PG 1 (b)P Gl (c\a)P Gl (d\a,b)P Gl (e\d)P(f\d) (22) 

a,b,e,f 

= (E P Gi (b)P Gl (c | a)P Gl (d I a, b)j (^2 (e \ d)P Gl (f \ d)j (23) 

(23) is clearly more efficient than (22). Observe that if the graph were like Gi in 
Figure 12, there would be no way to factorize computations like (23) but to use (22) requiring 
exponentially many operations. The problem is that computing marginal probabilities is 
NP-hard in general, and factorization such as (23) is assured only when the graph is singly 
connected like Gi, i.e. has no loop when viewed as an undirected graph. In such case, the 
computation is possible in 0(|y|) time where V is the set of vertices in the graph (Pearl, 
1988). Otherwise, the graph is called multiply- connected, and might need exponential time 
to compute marginal probabilities. In the sequel, we show the following. 

• For any discrete Bayesian network G defining a distribution P G (x\, . . . , xn), there is a 
parameterized logic program DB G for a predicate bn(-) such that P£>B G (i>n(xi , . . . ,x^) ) 
= P G (x 1 ,...,x N ). 

55. Thanks to the acyclicity of the graph, without losing generality, we may assume that if X t is an ancestor 
node of Xj, then i < j holds. 

56. For notational simplicity, we shall omit random variables when no confusion arises. 
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• For arbitrary factorizations and their order to compute a marginal distribution, there 
exists a tabled program that accomplishes the same computation in the specified way. 

• When the graph is singly connected and evidence e is given, there exists a tabled 
program DBqt such that OLDT time for <— bn(e) is 0(| V"|), and hence the time 
complexity per iteration of the graphical EM algorithm is 0( | V"| ) as well. 

Let G be a Bayesian network defining a joint distribution Pq(x\, . . . , sjv) and {Po(Xi = 
Xi | Hi = u{) | 1 < i < N, Xi £ val(Xi), U{ £ val(Hi)} the conditional probabilities 
associated with G where val(Xi) is the set of X;'s possible values and val(Hi) denotes 
the set of possible values of the parent nodes 11; as a random vector, respectively. We 
construct a parameterized logic program that defines the same distribution Pq(x\, . . . , xn). 
Our program DBq = Fq U Rq is shown in Figure 13. 

Fq = { msw(par (i ,U{) .once,^) | 1 < i < N, U{ £ v al(Hi), X{ £ val(Xi) } 
R G = { bn(Xi,.. .,Xjv) :-/\^ =1 msw(par(i,n 8 ) ,once,X 8 ) . } 

Figure 13: Bayesian network program DBq 

Fq is comprised of msw atoms of the form msw(par(i,-u;) , once, a;;) whose probability is 
exactly the conditional probability Pg(X; = X{ \ LTj- = u{). When Xi has no parents, U{ is 
the empty list [] . Rq is a singleton, containing only one clause whose body is a conjunction 
of msw atoms which corresponds to the product of conditional probabilities. Note that we 
intentionally identify random variables X\ , . . . , Xjv with logical variables Xi , . . . , Xjv for 
convenience. 

Proposition 5.5 DBq denotes the same distributions as G. 

(Proof) Let (x\, . . . , x^) be a realization of the random vector {X\, . . . ,Xn). It holds by 
construction that 

N 

P D B G (bn(x 1 ,. . .,x N )) = Y[ -Pms H (msw(par(i,-u 8 ) ,once,a; 8 )) 

h=l 

N 

= n p G (Xi = xi | n 8 = Ui ) 

h=l 

= P G (x 1 ,...,x N ). Q.E.D. 

In the case of G\ in Figure 12, the program becomes 57 



bn(A,B,C,D,E,F) :- msw(par( 'A' , [] ) , once, A) , msw(par( 'B' , [] ) ,once,B) , 

msw (par ( 'C , [A] ) ,once,C) , msw (par ( 'D' , [A,B] ) ,once,D) , 
msw (par ( 'E' , [D] ) ,once,E) , msw (par ( 'F' , [D] ) ,once,F) . 



57. 'A', 'b', . . . are Prolog constants used in place of integers. 
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and the left-to-right sampling execution gives a sample realization of the random vector 
( A,B, C,D, E, F ). A marginal distribution is computed from bn(a;i , . . .,x^) by adding a new 
clause to DBq. For example, to compute Po 1 (c,d), we add bn(C,D):- bn(A,B,C,D,E,F) 

to DBq 1 (let the result be DB' G ) and then compute Ppg' (bn(c,<i)) which is equal to 

G i 

Pq 1 (c, d) because 

P DB , Cbn(c,d)) = P DBa (3a,b,e,fbn(a,b,c,d,e,f)) 

G i 1 

= PDB Gl (^(a,b,c,d,e,f)) 

a,b,e,f 

= P Gl (c,d). 

Regrettably this computation corresponds to (22), not to the factorization (23). Efficient 
probability computation using factorization is made possible by carrying out summations 
in a proper order. 

We next sketch by an example how to carry out specified summations in a specified 
order by introducing new clauses. Suppose we have a joint distribution P(x,y,z,w) = 
(f>i(x,y)(f>2(y,z,w)(f>3(x,z,w) such that (f>i(x,y), (f>2(y,z,w) and (f>3(x,z,w) are respectively 
computed by atoms pi(X,Y), p2(Y,Z,W) and p3(X,Z,W). Suppose also that we hope to 
compute the sum 

p ( x ) = J2M x ,y) wZhiy^.w^ix.z.wU 

y \z,w / 

in which we first eliminate z, w and then y. Corresponding to each elimination, we introduce 
two new predicates, q(X,Y) to compute 4>^(x,y) = ^2 Z w (f>2(y,z,w)(f>3(x,z,w) and p(X) to 
compute P(x) = ^2 y (f>i(x , y)(f>4(x , y) as follows. 

p(X) :- pi(X,Y), q(X,Y). 
q(X,Y) :- p 2 (Y,Z,W), p 3 (X,Z,W). 

Note that the clause body of q/2 contains Z and W as (existentially quantified) local variables 
and the clause head q(X,Y) contains variables shared with other atoms. In view of the 
correspondence between an( i 3, it is easy to confirm that this program realizes the 
required computation. It is also easy to see by generalizing this example, though we do 
not prove here, that there exists a parameterized logic program that carries out the given 
summations in the given order for an arbitrary Bayesian network, in particular we are 
able to simulate VE (variable elimination, Zhang & Poole, 1996; D'Ambrosio, 1999) in our 
approach. 

Efficient computation of marginal distributions is not always possible but there is a 
well-known class of Bayesian networks, singly connected Bayesian networks, for which there 
exists an efficient algorithm to compute marginal distributions by message passing (Pearl, 
1988; Castillo et al., 1997). We here show that when the graph is singly connected, we can 
construct an efficient tabled Bayesian network program DBqt assigning a table predicate 
to each node. To avoid complications, we explain the construction procedure informally 
and concentrate on the case where we have only one interested variable. Let G be a singly 
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connected graph. First we pick up a node U whose probability Pg(u) is what we seek. We 
construct a tree G T with the root node U from G, by letting other nodes dangling from U. 
Figure 14 shows how G\ is transformed to a tree when we select node B as the root node. 




T 

Transformed graph G 1 

Figure 14: Transforming G\ to a tree 

Then we examine each node in G T one by one. We add for each node X in the graph a 
corresponding clause to DBqt whose purpose is to visit all nodes connected to X except the 
one that calls X . Suppose we started from the root node U\ in Figure 15 where evidence 
u is given, and have generated clause (24). Now we proceed to an inner node X (U\ calls 
X). In the original graph G, X has parent nodes \U\, V>2, U3} and child nodes {Vi, V2}. U3 
is a topmost node in G. 




Tree G T 
Figure 15: General situation 

For node X in Figure 15, we add clause (25). When it is called from the parent node 
U\ with Ui being ground, we first generate possible values of U2 by calling val_&2(U2), 
and then call call_X_&2 (U2) to visit all nodes connected to X through XJi- U3 is similary 
treated. After visiting all nodes in G connecting to X through the parent nodes U2 and 
t/3 (nodes connected to U\ have already been visited), the value of random variable X is 
determined by sampling the msw atom jointly indexed by 'X' and the values of U\, U2 and 
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t/3. Then we visit X's children, V\ and Vi- For a topmost node U3 in the original graph, 
we add clause (26). 

tbn(Ui) :- msw (par ('Ui ',[]), once, Ui ), call_f/i_X(Ui). (24) 

call_f/i_X(Ui) :- val_t/ 2 (U 2 ), call_X_f/ 2 (U 2 ) , 
val_f/ 3 (U 3 ), call_X_f/ 3 (U 3 ), 
msw(par( 'X' , [Ui ,U 2 ,U 3 ] ) ,once,X) , 

call_X_yi (X) , call_X_V2(X) . (25) 

call_XJ7 3 (U 3 ) :- msw (par ( 'U 3 ',[]), once ,U 3 ) . (26) 

Let DBqt be the final program containing clauses like (24), (25) and (26). Apparently 
DBqt can be constructed in time linear in the number of nodes in the network. Also 
note that successive unfolding (Tamaki & Sato, 1984) of atoms of the form call_. . . (•) 
in the clause bodies that starts from (24) yields a program DB' G similar to the one in 
Figure 13 which contains msw atoms but no call_. . . (-)'s. As DBqt and DB' G define the 
same distribution, 58 it can be proved from Proposition 5.5 that Pq(u) = Pp£' G (bn(M)) = 
Pdb G t (tbn(-u)) holds (details omitted). By the way, in Figure 15 we assume the construction 
starts from the topmost node Ui where the evidence u is given, but this is not necessary. 
Suppose we change to start from the inner node X. In that case, we replace clause (24) 
with call_X_f/i (Ui) :- msw (par ( 'Ui ' , []),once,Ui) just like (26). At the same time we 
replace the head of clause (25) with tbn() and add a goal call_X_t/i (u) to the body 
and so on. For the changed program DB" ' qt , it is rather straightforward to prove that 
Pdb" G t (tbn() ) = Pg{ u ) holds. It is true that the construction of the tabled program 
DBqt shown here is very crude and there is a lot of room for optimization, but it suffices 
to show that a parameterized logic program for a singly connected Bayesian network runs 
in 0(|y|) time where V is the set of nodes. 

To estimate time complexity of OLDT search w.r.t. DBqt, we declare tbn and every 
predicate of the form call_. . . (•) as table predicate and verify the five conditions on the 
applicability of the graphical EM algorithm (details omitted). We now estimate the time 
complexity of OLDT search for the goal <— tbn(^) w.r.t. DBqt , 59 We notice that calls occur 
according to the pre-order scan (parents - the node - children) of the tree G T , and calls 
to call_y_X(-) occur val(Y) times. Each call to call_Y_X(-) invokes calls to the rest of 
nodes, X's parents and X's children in the graph G T except the caller node, with diffrent 
set of variable instantiations, but from the second call on, every call only refers to solutions 
stored in the solution table in 0(1) time. Thus, the number of added computation steps in 

58. Since distribution semantics is based on the least model semantics, and because unfold/fold transforma- 
tion (Tamaki & Sato, 1984) preserves the least Herbrand model of the transformed program, unfold/fold 
transformation applied to parameterized logic programs preserves the denotation of the transformed 
program. 

59. DBqt is further transformed for the OLDT interpreter to collect msw atoms like the case of the HMM 
program. 
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OLDT search by X is bounded from above, by constant 0(val(U l)val(U2)val(U Z)val(X)) 
in the case of Figure 15. As a result OLDT time is proportional to the number of nodes 
in the original graph G (and so is the size of the support graph) provided that the number 
of edges connecting to a node, and that of values of a random variable are bounded from 
above. So we have 

Proposition 5.6 Let G be a singly connected Bayesian network defining distribution Pq, 
V the set of nodes, and DBqt the tabled program derived as above. Suppose the number of 
edges connecting to a node, and that of values of a random variable are bounded from above 
by some constant. Also suppose table access can be done in 0(1) time. Then, OLDT time 
for computing Pg(u) for an observed value u of a random variable U by means of DBqt is 
0(\V\) and so is time per iteration required by the graphical EM algorithm. If there are T 
observations, time complexity is 0(\V\T). 

0(\V\) is the time complexity required to compute a marignal distribution for a singly 
connected Bayesian network by a standard algorithm (Pearl, 1988; Castillo et al., 1997), 
and also is that of the EM algorithm using it. We therefore conclude that the graphical 
EM algorithm is as efficient as a specialzed EM algorithm for singly connected Bayesian 
networks. 60 We must also quickly add that the graphical EM algorithm is applicable to 
arbitrary Bayesian networks, 61 and what Proposition 5.6 says is that an explosion of the 
support graph can be avoided by appropriate programming in the case of singly connected 
Bayesian networks. 

To summarize, the graphical EM algorithm, a single generic EM algorithm, is proved to 
have the same time complexity as specialized EM algorithms, i.e. the Baum- Welch algorithm 
for HMMs, the Inside-Outside algorithm for PCFGs, and the one for singly connected 
Bayesian networks that have been developed independently in each research field. 

Table 1 summarizes the time complexity of EM learning using OLDT search and the 
graphical EM algorithm in the case of one observation. In the first column, "sc-BNs" 
represents singly connected Bayesian networks. The second column shows a program to use. 
DBh is an HMM proram in Subsection 4.7, DB g t a PCFG program in Subsection 5.3 and 
DBqt a transformed Bayesian network program in Subsection 5.5, respectively. OLDT time 
in the third column is time for OLDT search to complete the search of all t-explanations. 
gEM in the fourth column is time in one iteration taken by the graphical EM algorithm 
to update parameters. We use N, M, L and V respectively for the number of states in 
an HMM, the number of nonterminals in a PCFG, the length of an input string and the 
number of nodes in a Bayesian network. The last column is a standard (specialized) EM 
algorithm for each model. 



60. When a marginal distribution of Pa for more than one variable is required, we can construct a similar 
tabled program that computes marginal probabilities still in 0(|^|) time by adding extra-arguments 
that convey other evidence or by embedding other evidnece in the program. 

61. We check the five conditions with DBcj in Figure 13. The uniqueness condition is obvious as sampling 
always uniquely generates a sampled value for each random variable. The finite support condition is 
satisfied because there are only a finite number of random variables and their values. The acyclic support 
condition is immediate because of the acyclicity of Bayesian networks. The t-exclusiveness condition and 
the independent condition are easy to verify. 
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Model 


Program 


OLDT time 


gEM 


Specialized EM 


HMMs 


DB h 


0(N 2 L) 


0(N 2 L) 


Baum- Welch 


PCFGs 


DB g , 


0(M 3 L 3 ) 


0(M 3 L 3 ) 


Inside-Outside 


sc-BNs 


DB G r 


o(\v\) 


o(\v\) 


(Castillo et al., 1997) 


user model 




0(|OLDT tree|) 


0( support graph ) 





Table 1: Time complexity of EM learning by OLDT search and the graphical EM algorithm 



5.6 Modeling Language PRISM 

We have been developing a symbolic-statistical modeling laguage PRISM since 1995 (URL 
= http://mi.cs.titech.ac.jp/prism/) as an implementation of distribution semantics 
(Sato, 1995; Sato & Kameya, 1997; Sato, 1998). The language is intented for modeling 
complex symbolic-statistical phenomena such as discourse interpretation in natural language 
processing and gene inheritance interacting with social rules. As a programming language, 
it looks like an extension of Prolog with new built-in predicates including the msw predicate 
and other special predicates for manipulating msw atoms and their parameters. 

A PRISM program is comprised of three parts, one for directives, one for modeling and 
one for utilities. The directive part contains declarations such as values, telling the system 
what msw atoms will be used in the execution. The modeling part is a set of non-unit definite 
clauses that define the distribution (denotation) of the program by using msw atoms. The 
last part, the utility part, is an arbitary Prolog program which refers to predicates defined 
in the modeling part. We can use in the utility part learn built-in predicate to carry out 
EM learning from observed atoms. 

PRISM provides three modes of execution. The sampling execution correponds to a 
random sampling drawn from the distribution defined by the modeling part. The second 
one computes the probability of a given atom. The third one returns the support set for a 
given goal. These execution modes are available through built-in predicates. 

We must report however that while the implementation of the graphical EM algorithm 
with a simpified OLDT search mechanism has been under way, it is not completed yet. So 
currently, only Prolog search and learn- naive( DB , Q) in Section 4 are available for EM learn- 
ing though we realized, partially, structrure sharing of explanations in the implemention 
of learn-naive(DB,G). Putting computational efficiecy aside however, there is no problem 
in expressing and learning HMMs, PCFGs, pseudo PCSGs, Bayesian networks and other 
probailistic models by the current version. The learning experiments in the next section 
used a parser as a substitute for the OLDT interpreter, and the independently implemented 
graphical EM algorithm. 

6. Learning Experiments 

After complexity analysis of the graphical EM algorithm for popular symbolic-probabilistic 
models in the previous section, we look at an actual behavior of the graphical EM algorithm 
with real data in this section. We conducted learning experiments with PCFGs using two 
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corpora which have contrasting characters, and compared the performance of the graphical 
EM algorithm against that of the Inside-Outside algorithm in terms of time per iteration 
(= time for updating parameters). The results indicate that the graphical EM algorithm 
can outperform the Inside-Outside algorithm by orders of magnigude. Detalis are reported 
by Sato, Kameya, Abe, and Shirai (2001). Before proceeding, we review the Inside-Outside 
algorithm for completeness. 

6.1 The Inside-Outside Algorithm 

The Inside-Outside algorithm was proposed by Baker (1979) as a generalization of the 
Baum- Welch algorithm to PCFGs. The algorithm is designed to estimate parameters for a 
CFG grammar in Chomsky normal form containing rules expressed by numbers like i — ► j,k 
(1 < hjik < N for N nonterminals, where 1 is a starting symbol). Suppose an input 
sentence w\, . . . ,wl is given. In each iteration, it first computes in a bottom up manner 
inside probabilities e(s,t,i) = P{i w s , . . . ,w t ) and then computes outside probabilities 
f(s, t, i) = P(5 w\ , . . . , w s _i i Wt+i , • • • , wl) in a top-down manner for every s, t and 
«(l<s<^<iy,l<i< N). After computing both probabilities, parameters are 
updated by using them, and this process iterates until some predetermined criterion such 
as a convergence of the likelihood of the input sentence is achieved. Although Baker did 
not give any analysis of the Inside-Outside algorithm, Lari and Young (1990) showed that it 
takes 0(N 3 L 3 ) time in one iteration and Lafferty (1993) proved that it is the EM algorithm. 

While it is true that the Inside-Outside algorithm has been recognized as a standard EM 
algortihm for training PCFGs, it is notoriously slow. Although there is not much literature 
explicitly stating time required by the Inside-Outside algorithm (Carroll & Rooth, 1998; 
Beil, Carroll, Prescher, Riezler, & Rooth, 1999), Beil et al. (1999) reported for example 
that when they trained a PCFG with 5,508 rules for a corpus of 450,526 German subordi- 
nate clauses whose average ambiguity is 9,202 trees/clause using four machines (167MHz 
Sun UltraSPARCx2 and 296MHz Sun UltraSPARC-IIx2), it took 2.5 hours to complete 
one iteration. We discuss later why the Inside-Outside algorithm is slow. 

6.2 Learning Experiments Using Two Corpora 

We report here parameter learning of existing PCFGs using two corpora of moderate size 
and compare the graphical EM algorithm against the Inside-Outside algorithm in terms 
of time per iteration. As mentioned before, support graphs, input to the garphical EM 
algorithm, were generated by a parser, i.e. MSLR parser. 62 All measurements were made 
on a 296MHz Sun UltraSPARC-II with 2GB memory under Solaris 2.6 and the threshold 
for an increase of the log likelihood of input sentences was set to 10~ 6 as a stopping criterion 
for the EM algorithms. 

In the experiments, we used ATR corpus and EDR corpus (each converted to a POS 
(part of speech)-tagged corpus). They are similar in size (about 10,000) but contrasting in 
their characters, sentence length and ambiguity of their grammars. The first experiment 
employed ATR corpus which is a Japanese- English corpus (we used only the Japanese part) 
developed by ATR (Uratani, Takezawa, Matsuo, & Morita, 1994). It contains 10,995 short 

62. MSLR parser is a Tomita (Generalized LR) parser developed by Tanaka-Tokunaga Laboratory in Tokyo 
Institute of Technology (Tanaka, Takezawa, & Etoh, 1997). 
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conversational sentences, whose minimum length, average length and maximum length are 
respectively 2, 9.97 and 49. As a skeleton of PCFG, we employed a context free grammar 
(7atr comprising 860 rules (172 nonterminals and 441 terminals) manually developed for 
ATR corpus (Tanaka et al., 1997) which yields 958 parses/sentence. 

Because the Inside-Outside algorithm only accepts a CFG in Chomsky normal form, we 
converted G a tr into Chomsky normal form G^tr" ^atr con t a i ns 2,105 rules (196 nonter- 
minals and 441 terminals). We then divided the corpus into subgroups of similar length 
like (L = 1, 2), (L = 3, 4), . . . , (L = 25, 26), each containing randomly chosen 100 sentences. 
After these preparations, we compare at each length the graphical EM algorithm applied to 
(7atr and <j*^ against the Inside-Outside algorithm applied to <j*^ in terms of time per 
iteration by running them until convergence. 
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Figure 16: Time per iteration : FO vs. gEM (ATR) 



Curves in Figure 16 show the learning results where an x-axis is the length L of an input 
sentence and a y-axis is average time taken by the EM algorithm in one iteration to update 
all parameters contained in the support graphs generated from the chosen 100 sentences 
(other parameters in the grammar do not change). In the left graph, the Inside-Outside 
algorithm plots a cubic curve labeled "I-O". We omitted a curve drawn by the graphical 
EM algorithm as it drew the x-axis. The middle graph magnifies the left graph. The curve 
labeled "gEM (original)" is plotted by the graphical EM algorithm applied to the original 
grammar C a ^ r whereas the one labeled "gEM (Chomsky NF)" used Gt^ . At length 10, the 
average sentence length, it is measured that whichever grammar is employed, the graphical 
EM algorithm runs several hundreds times faster (845 times faster in the case of C a ^ r 
and 720 times faster in the case of <j*^ ) than the Inside-Outside algorithm per iteration. 
The right graph shows (almost) linear dependency of updating time by the graphical EM 
algorithm within the measuared sentence length. 

Although some difference is anticipated in their learning speed, the speed gap between 
the Inside-Outside algorithm and the graphical EM algorithm is unexpectedly large. The 
most conceivable reason is that ATR corpus only contains short sentences and C a ^ r is not 
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much ambiguous so that parse trees are sparse and generated support graphs are small, 
which affects favorably the perforamnce of the graphical EM algorithm. 

We therefore conducted the same experiment with another corpus which contains much 
longer sentences using a more ambiguous grammar that generates dense parse trees. We 
used EDR Japanese corpus (Japan EDR, 1995) containing 220,000 Japanese news article 
sentences. It is however under the process of re- annotation, and only part of it (randomly 
sampled 9,900 sentences) has recently been made available as a labeled corpus. Compared 
with ATR corpus, sentences are much longer (the average length of 9,900 sentences is 20, 
the minimum length 5, the maximum length 63) and a CFG grammar G e( ^ r (2,687 rules, 
converted to Chomsky normal form grammar <j*^ containing 12,798 rules) developed for 
it is very ambiguous (to keep a coverage rate), having 3.0 X 10 8 parses/sentence at length 
20 and 6.7 X 10 19 parses/sentence at length 38. 



(sec) (sec) (sec) 




Figure 17: Time per iteration : FO vs. gEM (EDR) 

Figure 17 shows the obtained curves from the experiments with EDR corpus (the graph- 
ical EM algorithm applied to G e( ^ r vs. the Inside- Outside algorithm applied to C*^) under 
the same condition as ATR corpus, i.e. plotting average time per iteration to process 100 
sentences of the designated length, except that the plotted time for the Inside-Outside al- 
gorithm is the average of 20 iterations whereas that for the graphical EM algorithm is the 
average of 100 iterations. As is clear from the middle graph, this time again, the graphical 
EM algorithm runs orders of magnitude faster than the Inside-Outside algorithm. At aver- 
age sentence length 20, the former takes 0.255 second whereas the latter takes 339 seconds, 
giving a speed ratio of 1,300 to 1. At sentence length 38, the former takes 2.541 seconds but 
the latter takes 4,774 seconds, giving a speed ratio of 1,878 to 1. Thus the speed ratio even 
widens compared to ATR corpus. This can be explained by the mixed effects of 0(L 3 ), 
time complexity of the Inside-Outside algorithm, and a moderate increase in the total size 
of support graphs w.r.t. L. Notice that the right graph shows how the total size of support 
graphs grows with sentence length L as time per iteration by the graphical EM algorithm 
is linear in the total size of support graphs. 
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Since we implemented the Inside- Outside algorithm faithfully to Baker (1979), Lari and 
Young (1990), there is much room for improvement. Actually Kita gave a refined Inside- 
Outside algorithm (Kita, 1999). There is also an implementation by Mark Johnson of the 
Inside-Outside algorithm down-loadable from http://www.cog. brown. edu/'/TEmj /. The 
use of such implementations may lead to different conclusions. We therefore conducted 
learning experiments with the entire ATR corpus using these two implementations and 
measured updating time per iteration (Sato et al., 2001). It turned out that both imple- 
mentations run twice as fast as our naive implementation and take about 630 seconds per 
iteration while the graphical EM algorithm takes 0.661 second per iteration, which is still 
orders of magnitude faster than the former two. Regrettably a similar comparison using 
the entire EDR corpus available at the moment was abandoned due to memory overflow 
during parsing for the construction of support graphs. 

Learning experiments so far only compared time per iteration which ignore extra time 
for search (parsing) required by the graphical EM algorithm. So a question naturally arises 
w.r.t. comparison in terms of total learning time. Assuming 100 iterations for learning 
ATR corpus however, it is estimated that even considering parsing time, the graphical 
EM algorithm combined with MSLR parser runs orders of magnitude faster than the three 
implementations (ours, Kita's and Johnson's) of the Inside-Outside algorithm (Sato et al., 
2001). Of course this estimation does not directly apply to the graphical EM algorithm 
combined with OLDT search, as the OLDT interpreter will take more time than a parser 
and how much more time is needed depends on the implementaiton of OLDT search. 63 
Conversely, however, we may be able to take it as a rough indication of how far our approach, 
the graphical EM algorithm combined with OLDT search via support graphs, can go in the 
domain of EM learning of PCFGs. 

6.3 Examing the Performance Gap 

In the previous subsection, we compared the performance of the graphical EM algorithm 
against the Inside-Outside algorithm when PCFGs are given, using two corpora and three 
implementations of the Inside-Outside algorithm. In all experiments, the graphical EM 
algorithm considerably outperformed the Inside-Outside algorithm despite the fact that 
both have the same time complexity. Now we look into what causes such a performance 
gap. 

Simply put, the Inside-Outside algorithm is slow (primarily) because it lacks parsing. 
Even when a backbone CFG grammar is explicitly given, it does not take any advantage of 
the constraints imposed by the grammar. To see it, it might help to review how the inside 
probability e(s,t,A), i.e. P(nonterminal A spans from s-th word to /-th word) (s < /), is 
calculated by the Inside-Outside algorithm for the given grammar. 

r=t-l 

e(s,t,A)= J2 E ?(A^BC)e(s,r,B)e(r+l,t,C) 

B,c s.t. A^rBG in the grammar r=s 

Here P(A — ► BC) is a probability associated with a production rule A BC. Note that for 
a fixed triplet (s, /, A), it is usual that the term P(A — ► BC)e(s, r, B)e(r+ 1, /, C) is non-zero 

63. We cannnot answer this question right now as the implementation of OLDT search is not completed. 
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only for a relatively small number of (_B,C, r)'s determined from successful parses and the 
rest of combinations always give to the term. Nonetheless the Inside-Outside algorithm 
attempts to compute the term in every iteration for all possible combinations of B, C and 
r and this is repeated for every possible (s, /, A), resulting in a lot of redundancy. The same 
kind of redundancy occurs in the computation of outside probability by the Inside-Outside 
algorithm. 

The graphical EM algorithm is free of such redundancy because it runs on parse trees (a 
parse forest) represented by the support graph. 64 It must be added, on the other hand, that 
superiority in learning speed of the graphical EM algorithm is realized at the cost of space 
complexity because while the Inside-Outside algorithm merely requires 0(NL 2 ) space for 
its array to store probabilities, the graphical EM algorithm needs 0(N 3 L 3 ) space to store 
the support graph where N is the number of nonterminals and L is the sentence length. 
This trade-off is understandable if one notices that the graphical EM algorithm applied 
to a PCFG can be considered as partial evaluation of the Inside-Outside algorithm by the 
grammar (and the introduction of appropriate data structure for the output). 

Finally we remark that the use of parsing as a preprocess for EM learning of PCFGs is 
not unique to the graphical EM algorithm (Fujisaki, Jelinek, Cocke, Black, & Nishino, 1989; 
Stolcke, 1995). These approaches however still seem to contain redundancies compared with 
the graphical EM algorithm. For instance Stolcke (1995) uses an Earley chart to compute 
inside and outside probability, but parses are implicitly reconstructed in each iteration 
dynamically by combining completed items. 

7. Related Work and Discussion 
7.1 Related Work 

The work presented in this paper is at the crossroads of logic programming and probability 
theory, and considering an enormous body of work done in these fields, incompleteness is 
unavoidable when reviewing related work. Having said that, we look at various attempts 
made to integrate probability with computational logic or logic programming. 65 In review- 
ing, one can immediately notice there are two types of usage of probability. One type, 
constraint approach, emphasizes the role of probability as constraints and does not nec- 
essarily seek for a unique probability distribution over logical formulas. The other type, 
distribution approach, explicitly defines a unique distribution by model theoretical means 
or proof theoretical means, to compute various probabilities of propositions. 

A typical constraint approach is seen in the early work of probabilistic logic by Nilsson 
(1986). His central problem, "probabilistic entailment problem", is to compute the upper 
and lower bound of probability P(<^>) of a target sentence (f> in such a way that the bounds 
are compatible with a given knowledge base containing logical sentences (not necessarily 
logic programs) annotated with a probability. These probabilities work as constraints on 



64. We emphasize that the difference between the Inside-Outside afgorithm and the graphicaf EM afgorithm 
is sofefy computationaf efficiency, and they converge to the same parameter vafues when starting from the 
same initiai vafues. Linguistic evafuations of the estimated parameters by the graphicaf EM afgorithm 
are afso reported by Sato et af. (2001). 

65. We omit fiterature feaning strongfy toward fogic. For fogic(s) concerning uncertainty, see an overview 
by Kyburg (1994). 
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the possible range of P(<^>). He used the linear programming technique to solve this problem 
that inevitably delimits the applicability of his approach to finite domains. 

Later Lukasiewicz (1999) investigated the computational complexity of the probabilistic 
entailment problem in a slightly different setting. His knowledge base comprises statements 
of the form (H \ G)[u\,U2\ representing u\ < F(H \ G) < ui- He showed that inferring 
"tight" Mi, U2 is NP-hard in general, and proposed a tractable class of knowledge base called 
conditional constraint trees. 

After the influential work of Nilsson, Frish and Haddawy (1994) introduced a deduc- 
tive system for probabilistic logic that remedies "drawbacks" of Nilsson's approach, that 
of computational intractability and the lack of a proof system. Their system deduces a 
probability range of a proposition by rules of probabilistic inferences about unconditional 
and conditional probabilities. For instance, one of the rules infers P(a \ 8) £ [0 y] from 
P(aV/3 | 8) £ [x y] where a, (3 and 8 are propositional variables and [x y] (x < y) designates 
a probability range. 

Turning to logic programming, probabilistic logic programming formalized by Ng and 
Subrahmanian (1992) and Dekhtyar and Subrahmanian (1997) was also a constraint ap- 
proach. Their program is a set of annotated clauses of the form A : fj, <— F\ : /ii, . . . , F n : fj, n 
where A is an atom, Fi (1 < i < n) a basic formula, i.e. a conjunction or a disjunction of 
atoms, and fj,j (0 < j < n) a sub-interval of [0, 1] indicating a probability range. A query 
<— 3 {F\ : /ii, . . . , F n : fj, n ) is answered by an extension of SLD refutation. On formalization, 
it is assumed that their language contains only a finite number of constant and predicate 
symbols, and no function symbol is allowed. 

A similar framework was proposed by Lakshmanan and Sadri (1994) under the same syn- 
tactic restrictions (finitely many constant and predicate symbols but no function symbols) 
in a different uncertainty setting. They used annotated clauses of the form A <— - B\, . . . , B n 
where A and Bi (1 < i < n) are atoms and c = {[a, /3], [7, 8]), a confidence level, represents 
a belief interval [a, (3] (0 < a < fi < 1) and a doubt interval [7,^] (0 < 7 < 8 < 1), which 
an expert has in the clause. 

As seen above, defining a unique probability distribution is of secondary or no concern 
to the constraint approach. This is in sharp contrast with Bayesian networks as the whole 
discipline rests on the ability of the networks to define a unique probability distribution 
(Pearl, 1988; Castillo et al., 1997). Researchers in Bayesian networks have been seeking for 
a way of mixing Bayesian networks with a logical representation to increase their inherently 
propositional expressive power. 

Breese (1992) used logic programs to automatically build a Bayesian network from a 
query. In Breese's approach, a program is the union of a definite clause program and a set 
of conditional dependencies of the form P(P | Q\ A • • • A Q n ) where P and QjS are atoms. 
Given a query, a Bayesian network is constructed dynamically that connects the query and 
relevant atoms in the program, which in turn defines a local distribution for the connected 
atoms. Logical variables can appear in atoms but no function symbol is allowed. 

Ngo and Haddawy (1997) extended Breese's approach by incorporating a mechanism 
reflecting context. They used a clause of the form P(Ao | A\, . . . , A n ) = a <— L\, . . . , L^, 
where A^s are called p-atoms (probabilistic atoms) whereas Lfs are context atoms disjoint 
from p-atoms, and computed by another general logic program (satisfying certain restric- 
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tions). Given a query, a set of evidence and context atoms, relevant ground p-atoms are 
identified by resolving context atoms away by SLDNF resolution, and a local Bayesian net- 
work is built to calculate the probability of the query. They proved the soundness and 
completeness of their query evaluation procedure under the condition that programs are 
acyclic 66 and domains are finite. 

Instead of defining a local distribution for each query, Poole (1993) defined a global dis- 
tribution in his "probabilistic Horn abduction". His program consists of definite clauses and 
disjoint declarations of the form disjoint ( [hi :pi , . . . ,h n ) which specifies a probabil- 
ity distribution over the hypotheses (abducibles) {hi, . . . ,h n }. He assigned probabilities to 
all ground atoms with the help of the theory of logic programming, and furthermore proved 
that Bayesian networks are representable in his framework. Unlike previous approaches, his 
language contains function symbols, but the acyclicity condition imposed on the programs 
for his semantics to be definable seems to be a severe restriction. Also, probabilities are not 
defined for quantified formulas. 

Bacchus et al. (1996) used a much more powerful first-order probabilistic language than 
clauses annotated with probabilities. Their language allows a statistically quantified term 
such as || 4>(x)\9(x) \\ x to denote the ratio of individuals in a finite domain satisfying (f>(x) A 
9(x) to those satisfying 9(x). Assuming that every world (interpretation for their language) 
is equally likely, they define the probability of a sentence ip under the given knowledge 



possible worlds containing N individuals satisfying x, and r parameters used in judging 
approximations. Although the limit does not necessarily exist and the domain must be finite, 
they showed that their method can cope with difficulties arising from "direct inference" and 
default reasoning. 

In a more linguistic vein, Muggleton (1996, and others) formulated SLPs (stochastic 
logic programs) procedurally, as an extension of PCFGs to probabilistic logic programs. 
So, a clause C, which must be range-restricted, 67 is annotated with a probability p like 
p : C . The probability of a goal G is the product of such ps appearing in its refutation but 
with a modification such that if a subgoal g can invoke n clauses, pi : C\ (1 < i < n) at 
some refutation step, the probability of choosing &-th clause is normalized to Pk/ ^7=1 Pi- 
More recently, Cussens (1999, 2001) enriched SLPs by introducing a special class of 
log-linear models for SLD refutations w.r.t. a given goal. He for example considers all 
possible SLD refutations for the most general goal <— s(X) and defines probability P(-R) 
of a refutation R as P(-R) = Z~ x exp Xif(R, i)). Here A; is a number associated with 
a clause C\ and v(R,i) is a feature, i.e. the number of occurrences of C\ in R. Z is the 
normalizing constant. Then, the probability assigned to s(a) is the sum of probabilities of 
refutation for <— s(a). 



66. The condition says that every ground atom A must be assigned a unique integer n(A) such that n(A) > 
n(Bi), . . . , n(B„) holds for any ground instance of a clause of the form A <— B\, . . . , B n . Under this 
condition, when a program includes p(X) <— q(X, Y), we cannot write recursive clauses about q such as 



67. A syntactic property that variables appearing in the head also appear in the body of a clause. A unit 
clause must be ground. 




where ^worlds]y(x) lS the number of 



g(X, [H\Y]) <— q(X, Y). 
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7.2 Limitations and Potential Problems 

Approaches described so far have more or less similar limitations and potential problems. 
Descriptive power confined to finite domains is the most common limitation, which is due 
to the use of the linear programming technique (Nilsson, 1986), or due to the syntactic 
restrictions not allowing for infinitely many constant, function or predicate symbols (Ng 
& Subrahmanian, 1992; Lakshmanan & Sadri, 1994). Bayesian networks have the same 
limitation as well (only a finite number of random variables are representable). 68 Also there 
are various semantic/syntactic restrictions on logic programs. For instance the acyclicity 
condition imposed by Poole (1993) and Ngo and Haddawy (1997) prevents the unconditional 
use of clauses with local variables, and the range-restrictedness imposed by Muggleton 
(1996) and Cussens (1999) excludes programs such as the usual membership Prolog program. 

There is another type of problem, the possibility of assigning conflicting probabilities 
to logically equivalent formulas. In SLPs, P(A) and P(A A A) do not necessarily coincide 
because A and AAA may have different refutations (Muggleton, 1996; Cussens, 1999, 2001). 
Consequently in SLPs, we would be in trouble if we naively interpret P(A) as the probability 
of A's being true. Also assigning probabilities to arbitrary quantified formulas seems out of 
scope of both approaches to SLPs. 

Last but not least, there is a big problem common to any approach using probabilities: 
where do the numbers come from? Generally speaking, if we use n binary random variables in 
a model, we have to determine 2 n probabilities to completely specify their joint distribution, 
and fulfilling this requirement with reliable numbers quickly becomes impossible as n grows. 
The situation is even worse when there are unobservable variables in the model such as 
possible causes of a disease. Apparently parameter learning from observed data is a natural 
solution to this problem, but parameter learning of logic programs has not been well studied. 

Distribution semantics proposed by Sato (1995) was an attempt to solve these problems 
along the line of the global distribution approach. It defines a distribution (probability 
measure) over the possible interpretations of ground atoms for an arbitrary logic program 
in any first order language and assigns consistent probabilities to all closed formulas. Also 
distribution semantics enabled us to derive an EM algorithm for the parameter learning of 
logic programs for the first time. As it was a naive algorithm however, dealing with large 
problems was difficult when there are exponentially many explanations for an observation 
like HMMs. We believe that the efficiency problem is solved to a large extent by the 
graphical EM algorithm presented in this paper. 

7.3 EM Learning 

Since EM learning is one of the central issues in this paper, we separately mention work 
related to EM learning for symbolic frameworks. Roller and Pfeffer (1997) used in their 
approach to KBMC (knowledge-based model construction) EM learning to estimate pa- 
rameters labeling clauses. They express probabilistic dependencies among events by defi- 
nite clauses annotated with probabilities, similarly to Ngo and Haddawy's (1997) approach, 
and locally build a Bayesian network relevant to the context and evidence as well as the 

68. However, RPMs (recursive probability models) proposed by Pfeffer and Roller (2000) as an extension 
of Bayesian networks allow for infinitely many random variables. They are organized as attributes of 
classes and a probability measure over attribute values is introduced. 
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query. Parameters are learned by applying to the constructed network the specialized EM 
algorithm for Bayesian networks (Castillo et al., 1997). 

Dealing with a PCFG by a statically constructed Bayesian network was proposed Pyna- 
dath and Wellman (1998), and it is possible to combine the EM algorithm with their method 
to estimate parameters in the PCFG. Unfortunately, the constructed network is not singly 
connected, and time complexity of probability computation is potentially exponential in the 
length of an input sentence. 

Closely related to our EM learning is parameter learning of log-linear models. Rie- 
zler (1998) proposed the IM algorithm in his approach to probabilistic constraint pro- 
gramming. The IM algorithm is a general parameter estimation algorithm from incom- 
plete data for log-linear models whose probability function F(x) takes the form F(x) = 
Z~ x exp (^"=1 ^i^iix ')) Po( x ) where (Ai, . . . , X n ) are parameters to be estimated, Vi{x) the 
i-th feature of an observed object x and Z the normalizing constant. Since a feature can 
be any function of x, the log-linear model is highly flexible and includes our distribution 
P msH as a special case of Z = 1. There is a price to pay however; the computational cost 
of Z. It requires a summation over exponentially many terms. To avoid the cost of exact 
computation, approximate computation by a Monte Carlo method is possible. Whichever 
one may choose however, learning time increases compared to the EM algorithm for Z = 1. 

The FAM (failure-adjusted maximization) algorithm proposed by Cussens (2001) is an 
EM algorithm applicable to pure normalized SLPs that may fail. It deals with a special 
class of log-linear models but is more efficient than the IM algorithm. Because the statistical 
framework of the FAM is rather different from distribution semantics, comparison with the 
graphical EM algorithm seems difficult. 

Being slightly tangential to EM learning, Roller et al. (1997) developed a functional 
modeling language defining a probability distribution over symbolic structures in which 
they showed "cashing" of computed results leads to efficient probability computation of 
singly connected Bayesian networks and PCFGs. Their cashing corresponds to the compu- 
tation of inside probability in the Inside-Outside algorithm and the computation of outside 
probability is untouched. 

7.4 Future Directions 

Parameterized logic programs are expected to be a useful modeling tool for complex symbolic- 
statistical phenomena. We have tried various types of modeling, besides stochastic gram- 
mars and Bayesian networks, such as the modeling of gene inheritance in the Kariera tribe 
(White, 1963) where the rules of bi-lateral cross-cousin marriage for four clans interact with 
the rules of genetic inheritance (Sato, 1998). The model was quite interdisciplinary, but 
the flexibility of combining msw atoms by means of definite clauses greatly facilitated the 
modeling process. 

Although satisfying the five conditions in Section 4 

• the uniqueness condition (roughly, one cause yields one effect) 

• the finite support condition (there are a finite number of explanations for one obser- 
vation) 

• the acyclic support condition (explanations must not be cyclic) 
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• the t-exclusiveness condition (explanations must be mutually exclusive) 

• the independence condition (events in an explanation must be independent) 

for the applicability of the graphical EM algorithm seems daunting, our modeling experi- 
ences so far tell us that the modeling principle in Section 4 effectively guides us to successful 
modeling. In return, we can obtain a declarative model described compactly by a high level 
language whose parameters are efficiently learnable by the graphical EM algorithm as shown 
in the preceding section. 

One of the future directions is however to relax some of the applicability conditions, 
especially the uniqueness condition that prohibits a generative model from failure or from 
generating multiple observable events. Although we pointed out in Section 4.4 that the MAR 
condition in Appendix B adapted to our semantics can replace the uniqueness condition and 
validates the use of the graphical EM algorithm even when a complete data does not uniquely 
determine the observed data just like the case of "partially bracketed corpora" (Pereira & 
Schabes, 1992), we feel the need to do more research on this topic. Also investigating 
the role of the acyclicity condition seems theoretically interesting as the acyclicity is often 
related to the learning of logic programs (Arimura, 1997; Reddy & Tadepalli, 1998). 

In this paper we only scratched the surface of individual research fields such as HMMs, 
PCFGs and Bayesian networks. Therefore, there remains much to be done about clarifying 
how experiences in each research field are reflected in the framework of parameterized logic 
programs. For example, we need to clarify the relationship between symbolic approaches 
to Bayesian networks such as SPI (Li, Z. & D'Ambrosio, B., 1994) and our approach. 
Also it is unclear how a compiled approach using the junction tree algorithm for Bayesian 
networks can be incorporated into our approach. Aside from exact methods, approximate 
methods of probability computation specialized for parameterized logic programs must also 
be developed. 

There is also a direction of improving learning ability by introducing priors instead of ML 
estimation to cope with data sparseness. The introduction of basic distributions that make 
probabilistic switches correlated seems worth trying in the near future. It is also important 
to take advantage of the logical nature of our approach to handle uncertainty. For example, 
it is already shown by Sato (2001) that we can learn parameters from negative examples 
such as "the grass is not wet" but the treatment of negative examples in parameterized 
logic programs is still in its infancy. 

Concerning developing complex statistical models based on the "programs as distribu- 
tions" scheme, stochastic natural language processing which exploits semantic information 
seems promising. For instance, unification-based grammars such as HPSGs (Abney, 1997) 
may be a good target beyond PCFGs because they use feature structures logically de- 
scribable, and the ambiguity of feature values seems to be expressible by a probability 
distribution. 

Also building a mathematical basis for logic programs with continuous random variables 
is a challenging research topic. 
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8. Conclusion 

We have proposed a logical/mathematical framework for statistical parameter learning of 
parameterized logic programs, i.e. definite clause programs containing probabilistic facts 
with a parameterized probability distribution. It extends the traditional least Herbrand 
model semantics in logic programming to distribution semantics , possible world semantics 
with a probability distribution over possible worlds (Herbrand interpretations) which is 
unconditionally applicable to arbitrary logic programs including ones for HMMs, PCFGs 
and Bayesian networks. 

We also have presented a new EM algorithm, the graphical EM algorithm in Section 4, 
which learns statistical parameters from observations for a class of parameterized logic pro- 
grams representing a sequential decision process in which each decision is exclusive and 
independent. It works on support graphs, a new data structure specifying a logical relation- 
ship between an observed goal and its explanations, and estimates parameters by computing 
inside and outside probability generalized for logic programs. 

The complexity analysis in Section 5 showed that when OLDT search, a complete tabled 
refutation method for logic programs, is employed for the support graph construction and 
table access is done in 0(1) time, the graphical EM algorithm, despite its generality, has 
the same time complexity as existing EM algorithms, i.e. the Baum- Welch algorithm for 
HMMs, the Inside-Outside algorithm for PCFGs and the one for singly connected Bayesian 
networks that have been developed independently in each research field. In addition, for 
pseudo probabilistic context sensitive grammars with N nonterminals, we showed that the 
graphical EM algorithm runs in time 0(N 4 L 3 ) for a sentence of length L. 

To compare actual performance of the graphical EM algorithm against the Inside- 
Outside algorithm, we conducted learning experiments with PCFGs in Section 6 using two 
real corpora with contrasting characters. One is ATR corpus containing short sentences for 
which the grammar is not much ambiguous (958 parses/sentence), and the other is EDR 
corpus containing long sentences for which the grammar is rather ambiguous (3.0 X 10 8 
at average sentence length 20). In both cases, the graphical EM algorithm outperformed 
the Inside-Outside algorithm by orders of magnitude in terms of time per iteration, which 
suggests the effectiveness of our approach to EM learning by the graphical EM algorithm. 

Since our semantics is not limited to finite domains or finitely many random variables 
but applicable to any logic programs of arbitrary complexity, the graphical EM algorithm is 
expected to give a general yet efficient method of parameter learning for models of complex 
symbolic-statistical phenomena governed by rules and probabilities. 
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Appendix A. Properties of Ppg 

In this appendix, we list some properties of Pbb defined by a parameterized logic program 
DB = F U R in a countable first-order language £. 69 First of all, Pbb assigns consistent 
probabilities 70 to every closed formula <f> in C by 

Pdb(^) d = Pdb(W £SI db \uj |= cf>}) 

while guaranteeing continuity in the sense that 

lim^oo P D B(Hh) A • •• A (f>(t n )) = P DB {Vx<i){x)) 
Um„^ oo P DB (0(*i)V---V0(i„)) = P DB (3x<f>(x)) 

where t\,t2, ■ ■ ■ is an enumeration of ground terms in £. 

The next proposition, Proposition A.l, relates Pbb to the Herbrand model. To prove 
it, we need some terminology. A factor is a closed formula in prenex disjunctive normal 
form Qi ■ ■ ■ Q n M where Qi (1 < i < n) is either an existential quantification or a universal 
quantification and M a matrix. The length of quantifications n is called the rank of the 
factor. Define $ as a set of formulas made out of factors, conjunctions and disjunctions. 
Associate with each formula (f> in $ a multi-set r((f>) of ranks by 

{0 if <f> is a factor with no quantification 

{n} if <f> is a factor with rank n 

r(<j)i) W r{4> 2 ) if 4> = 4>\ V (^2 or <j> = <f>i A 

Here W stands for the union of two multi-sets. For instance {1, 2, 3}l+){2, 3, 4} = {1,2,2,3,3,4}. 
We use the multi-set ordering in the proof of Proposition A.l because the usual induction 
on the complexity of formulas does not work. 

Lemma A.l Let (f> be a boolean formula made out of ground atoms in £. PDBi^ 1 ) = 

P F ({v g n F | M DB {y) |= 4>}). 

(Proof) We have only to prove the lemma about a conjunction of atoms of the form D^ 1 A 
• •• A D x n n (x t G {0,1}, 1 <i<n). 

P DB (D? A---AD?) = P db ({oj e £l DB \ u, \= D X S A ■ ■ ■ A D x n "}) 

= Pdb(D 1 = x 1 ,...,D n = x n ) 

= P F {{v G tt F | M DB {v) |= A ••• A £>*»}) Q.E.D. 
Proposition A.l Let cj) be a closed formula in L. PoBi^ 1 ) = PfH^ G &f \ M F > B (i') |= <f\). 



69. For definitions of Q_p, Pf, Mbb(v), &db, Pdb and others used below, see Section 3. 

70. By consistent, we mean probabilities assigned to logical formulas respect the laws of probability such as 
< P{A) < 1 , P{-iA) = 1 - P(A) and P(A V B) = P(A) + P(B) - P(A A B). 
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(Proof) Recall that a closed formula has an equivalent prenex disjunctive normal form 
that belongs to We prove the proposition for formulas in $ by using induction on the 
multi-set ordering over {r((f>) \ 4> £ If r (<^) = 0? 4> nas no quantification. So the 

proposition is correct by Lemma A.l. Suppose otherwise. Write <f> = G[Q\Q2 ■ ■ ■ Q n F] 
where Q\Q 2 • • • Q n F indicates a single occurrence of a factor in G. 71 We assume Q\ = 3x 
(Q 1 = \/x is similarly treated). We also assume that bound variables are renamed to avoid 
name clash. Then G[3xQ2 ■ ■ ■ Q n F] is equivalent to 3xG[Q2 ■ ■ ■ Q n F] in light of the validity 
of (3xA) A B = 3x(A A B) and (3xA) V B = 3x(A V B) when B contains no free x. 

Pdb(^) = P DB (G[Q 1 Q2---QnF]) 

= P DB (3xG[Q 2 ---Q n F[x]]) 

= lim P DB (G[ Q 2 ■ ■ ■ QnFih}} V • • • V G[Q 2 ■ ■ ■ Q n F[t k ] ]) 

k^co 

= lim P DB (G[ Q 2 ... Q n F[h] V • • • V Q 2 ■ ■ ■ Q n F[t k ] ]) 

k^co 

= lim P F {{v G n F I M DB {v) |= G[ Q 2 ■ ■ ■ Q n F[h] V • • • V Q 2 ■ ■ ■ Q n F[t k ] ]}) 

k^co 

(by induction hypothesis) 
= P F {{v G tt F I M DB {v) |= 3xG[Q 2 ■ ■ ■ Q n F[x}}}) 
= P F {{v G n F I M DB {v) |= <f>}) Q.E.D. 

We next prove a theorem on the iff definition introduced in Section 4. Distribution 
semantics considers the program DB = F U R as a set of infinitely many ground definite 
clauses such that F is a set of facts (with a probability measure P F ) and R a set of rules, 
and no clause head in R appears in F. Put 

head(R) ^ {B \ B appears in R as a clause head}. 

For B G head(R), let B <— W{ (i = 1,2, . . .) be an enumeration of clauses about B in R. 
Define iff (B), the iff (if-and-only-if) form of rules about B in DB 72 by 

iS(B) d = B ^ Wi V W 2 V • • • 

Since M F > B (i') is a least Herbrand model, the following is obvious. 

Lemma A. 2 For B in head(R) and v G £If, M F > B (i') |= iff(B). 

Theorem A.l below is about iff (B). It states that at general level, both sides of the iff 
definition p(x) ^ 3yi(x = t\ A W\) V • • • V 3y n (x = t n A W n ) of p(-) coincide as random 
variables whenever x is instantiated to a ground term. 

Theorem A.l Let ijf(B) = B ^ W 1 VW 2 V • • • be the iff form of rules about B G head(R). 
P DB {iff{B)) = 1 and P DB (B) = P DB (W 1 VW 2 V---). 



71. For an expression E, E[y] means that 7 may occur in the specified positions of E. If 71 V 72 in E[ji V 72] 
indicates a single occurrence of 71 V 72 in a positive boolean formula E, E[ji V 72] = £[71] V £[72] holds. 

72. This definition is different from the usual one (Lloyd, 1984; Doets, 1994) as we are here talking at ground 
level. Wi V W2 V • • • is true if and only if one of the disjuncts is true. 
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(Proof) 



P DB (iS(B)) = P db ({uj G tt DB | u |= P A (W 1 V W 2 V • • •)}) 

+P DB ({w G DB | w |= -.P A -.(Wi V W 2 V • • •)}) 

= lim P DB ({^ G db I u |= 5 A \/ Wi}) 



8 = 1 



+ lim Pdb({^ G db I w |= ^P A -i W Wi}) 

k^co . 

8 = 1 

= lim P F ({z/ G F | M DB (v) |= P A V W 8 }) 

k^co 

8 = 1 

+ lim Pf({/^ G f I M DB {v) \=^B A^\J W 8 }) 

8 = 1 

(Lemma A.l) 
= P F ({z/ G F | M DB {y) |= iff(P)}) 
= Pf(^f) (Lemma A.2) 
= 1 

It follows from P DB (iS(B)) = 1 that 

Pdb(B) = P DB (B A iff (P)) = P DB (W 1 V W 2 V • • •)• Q.E.D. 

We then prove a proposition useful in probability computation. Let iPdb(B) be the 
support set for an atom P introduced in Section 4 (it is the set of all explanations for P). In 
the sequel, P is a ground atom. Write ^db(B) = {Si, S 2 , . . .} and \J ^db(B) = SiVS 2 V- • - 73 
Define a set A B by 

A B d = {uj G tt DB | u |= P ~ V ^db(B)}. 

Proposition A.2 For euen/ P G head(R), Pdb(^b) = 1 Pdb(B) = Pdb{\I ^db(B)). 

(Proof) We first prove Pdb(A-b) = 1 but the proof exactly parallels that of Theorem A.l 
except that Wi V W 2 V • • • is replaced by Si V S 2 V • • • using the fact that P ^ Si V S 2 V • • • 
is true in every least Herbrand model of the form Me> b (v). Then from Pdb(A-b) = 1, we 
have 

Pdb(B) = Pdb(B A (P <-> \J iPdb{B))) 

= Pdb<\/ 1>db{B)). Q.E.D. 

Finally, we show that distribution semantics is a probabilistic extension of the traditional 
least Herbrand model semantics in logic programming by proving Theorem A.2. It says that 
the probability mass is distributed exclusively over possible least Herbrand models. 

Define A as the set of least Herbrand models generated by fixing R and varying a subset 
of F in the program DB = F U R. In symbols, 

73. For a set K = {E\, E2, . . .} of formulas, \J K denotes a (-n infinite) disjunction E\ V E2 V • • • 
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A = {lo £ Qdb I u = Mdb{v) for some v £ fi F }. 

Note that as A is merely a subset of Hdb-, we cannot conclude Pdb(A) = 1 a priori, but the 
next theorem, Theorem A. 2, states Pdb(A) = 1, i.e. distribution semantics distributes the 
probability mass exclusively over A, i.e. possible least Herbrand models. 

To prove the theorem, we need some preparations. Recalling that atoms outside head(R)U 
F have no chance of being proved from DB, we introduce 

A' = f {lo £ Qdb I u |= for every ground atom D £" head(R) U F}. 

For a Herbrand interpretation lo £ Hdb-, ^>\f (G ^f) is the restriction of a; to those atoms 
in F. 

Lemma A. 3 Let lo £ be a Herbrand interpretation. 

lo = Mdb(v) f or some v £ ftp iff ' lo £ A' and lo \= B ^ \J iPdb(B) f or every B £ head(R). 

(Proof) Only-if part is immediate from the property of the least Herbrand model. For 
if-part, suppose lo satisfies the right hand side. We show that lo = Me>b(lo\f). As lo and 
Me>b(lo\f) coincide w.r.t. atoms not in head(R), it is enough to prove that they also give 
the same truth values to atoms in head(R). Take B £ head(R) and write \J ' iPdb(B) = 
Si V 52 V • • • Suppose lo |= B <r+ Si V 52 V • • • Then if lo |= B, we have lo |= Sj for some j, 
thereby lo\f |= Sj, and hence Mbb(lo\f) |= Sj, which implies Mbb(lo\f) |= B. Otherwise 
lo |= -i_B. So lo |= -iSj for every j. It follows that Md£(cj| f ) |= -iP. Since P is arbitrary, 
we conclude that lo and Mbb(lo\f) agree on the truth values assigned to atoms in head(R) 
as well. Q.E.D. 

Theorem A. 2 Pdb(A) = 1. 
(Proof) From Lemma A. 3, we have 

A = {lo £ Qdb | u = Musiy) for some v £ S7 F } 

= A'n n a b- 

Behead(R) 

Pdb(A-b) = 1 by Proposition A. 2. To prove Pdb(A') = 1, let Di, Di, ■ ■ ■ be an enumeration 
of atoms not belonging to head(R) U F. They are not provable from DB = F U R, and 
hence false in every least Herbrand model Mdb(v) { v G &f)- So 

Pdb{M) = Hm Pdb({^ G O db | lo |= -i-Di A • • • A ^P m }) 

m^oo 

= lim Pf({/^ £ &F I M DB {v) 1= -nP>i A • • • A -nP> m }) 

m^oo 
= P F (ft F ) = l. 

Since a countable conjunction of measurable sets of probability measure one has also 
probability measure one, it follows from Pdb(A-b) = 1 f° r every B £ head(R) and Pdb(A') = 
1 that Pdb(A) = 1. Q.E.D. 
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Appendix B. The MAR (missing at random) Condition 

In the original formulation of the EM algorithm by Dempster et al. (1977), it is assumed 
that there exists a many-to-one mapping y = x( x ) from a complete data x to an incomplete 
(observed) data y. In the case of parsing, a; is a parse tree and y is the input sentence and x 
uniquely determines y. In this paper, the uniqueness condition ensures the existence of such 
a many-to-one mapping from explanations to observations. We however sometimes face a 
situation where there is no such many-to-one mapping from complete data to incomplete 
data but nonetheless we wish to apply the EM algorithm. 

This dilemma can be solved by the introduction of a missing-data mechanism which 
makes a complete data incomplete. The missing-data mechanism, m, has a distribution 
gcf,(m | x) parameterized by (f> and y, the observed data, is described as y = Xm( x )- It says 
x becomes incomplete y by m. The correspondence between x and y, i.e. {{x,y) \ 3m(y = 
Xm( x ))} naturally becomes many-to-many. 

Rubin (1976) derived two conditions on (data are missing at random and data are 
observed at random) collectively called the MAR (missing at random) condition, and showed 
that if we assume a missing-data mechanism behind our observations that satisfies the MAR 
condition, we may estimate parameters of the distribution over x by simply applying the 
EM algorithm to y, the observed data. 

We adapt the MAR condition to parameterized logic programs as follows. We keep a 
generative model satisfying the uniqueness condition that outputs goals G such as parse 
trees. We further extend the model by additionally inserting a missing-data mechanism 
m between G and our observation O like O = x m (G) and assume m satisfies the MAR 
condition. Then the extended model has a many-to-many correspondence between expla- 
nations and observations, and generates non-exclusive observations such that P(0 AO') > 
(O / O'), which causes £o P(O) > 1 where P(O) = £ G:3m 0=Xm (G) Pdb(G). Thanks to 
the MAR condition however, we are still allowed to apply the EM algorithm to such non- 
exclusive observations. Put it differently, even if the uniqueness condition is seemingly 
destroyed, the EM algorithm is applicable just by (imaginarily) assuming a missing-data 
mechanism satisfying the MAR condition. 
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